4. Web Automation in Action: Mastering Amazon Data Collection

You’ve already laid a solid foundation in workflow logic, understanding web elements, and chaining commands for seamless data collection. Now, let’s dive into a hands-on project that turns that theory into action: automating Amazon product data collection with RPA. I’ll walk you through key techniques like looping through similar elements and using relative positioning—skills that will help you spot market demands, analyze competitors, and refine your processes. Whether you’re crunching stats or doing market research, this project will turn your RPA knowledge into a real competitive edge.

Case Background

In cross-border e-commerce and market research, Amazon product data—names, prices, ratings, and more—is make-or-break for smart product selection and competitor analysis. But manual work? It’s a nightmare: visiting Amazon, searching keywords, and copying data page by page is not only time-consuming but also full of errors and omissions.

With Octoparse AI, we’ll automate the entire workflow—from searching and multi-dimensional data collection to Excel exports—solving those manual pain points and turning your skills into actionable results.

Workflow Logic: From Start to Data Extraction

First off, sketch a flowchart before building the application. I always recommend this simple step because it sharpens your logic and uncovers potential gaps early on. For this Amazon project, visualize steps like inputting parameters (keywords, target pages), opening web elements, looping through data, storing results, and returning to the product list page. It keeps you from missing critical steps and prevents mid-build rework—trust me, this habit will save you hours on any project.

Building Your Workflow: A Step-by-Step Guide

Breaking Down the Logic: Key Automation Steps

Phase 1: Access and Search Navigation

Goal: Establish a connection and initiate the search.

Action: Start your workflow by using the Go to Web Page command to open Amazon in the built-in browser. To simulate a real user, use the Fill Text Field action to input your target keywords and click the search button. Once the results load, save the current screen as a "listPage" object. To prepare for the data you’re about to collect, initialize a data table variable (e.g., "datasheet") with seven columns to act as your central storage.

Phase2: Multi-Page Pagination

Goal: Achieve continuous collection of multi-page products

Action：

Create a new numeric variable initial_page_number with an initial value of 1 using "Create variable." Call the "Loop by condition" command and set the loop condition as initial_page_number ≤ pageNumber (pageNumber is the target number of pages you want to collect). After collecting each page, use the "Set variable" to add 1 to this variable—this. This increases the page number and keeps the collection going.

📕 Note:

This loop is the core of multi-page collection — it automatically controls the range of pages to be collected through the judgment and update of page number variables.

【Loop by condition】Instruction Learning Entry：Click to enter

Phase 3: Building Element Loops

Goal: Batch traverse all products on a single page

Action: To batch process products on a page, call "Loop through similar web elements" on listPage. Target the consistent product cards and store each one as productElement. Keep in mind: The interface structure of the product cards is completely consistent, so using the "similar element group loop" lets you batch traverse all products on a single page—replacing the repetitive work of manually clicking one by one.

Note: The interface structure of the product cards is completely consistent, so using the "similar element group loop" can batch traverse all products on a single page, replacing the repetitive operation of manually clicking one by one.

【Loop through similar web elements】Instruction Learning Entry：Click to enter

Phase 4: Relative XPath element positioning

Goal: Accurately locate sub-information within products (such as name, price, etc.)

Action: Call the "Get relative element on web page" command and use relative XPath to locate child elements (like product name, price) based on "productElement" (the product card object).

Note: Relative XPath is positioned based on the parent element of the product card, not an absolute path. Even if the overall page structure is slightly adjusted, as long as the internal structure of the product card remains unchanged, you can achieve stable positioning—avoiding the problem where absolute XPath "becomes invalid as soon as the page changes.

Related commands:

Get relative element on web page
If the above knowledge points feel a bit difficult, don’t worry—we also have a simple and efficient method: Extract data from web page

Phase 5: Using If Conditions for Dynamic Logic

Goal: Avoid process errors caused by empty elements

Action: To keep your process running smoothly, you add an 'If' check for empty elements, like a missing product name. If the text is blank, execute 'Next loop' to skip that item and move on. Some products lack details, such as no price, so this exception handling is crucial—it. It lets your automation breeze past invalids without crashing. You'll appreciate this when dealing with real-world messy data.

Note: Some products may have missing information (such as no price). This judgment is the key to "exception handling" - it allows the process to automatically skip invalid products and avoid interruptions due to empty elements.

Related command: If

Phase 6: Data table storage

Goal: Store collected data

Action: Each time you complete product data collection within the loop, immediately write all the collected information to the data table using "Write to row." After that, navigate back to the product list page (listPage).

Phase 7: Finalization and data export

Goal: Complete collection and export data to Excel

Action: After completing each page, click the "Next Page" button to move to the next set of products. Once all pages have been collected, use "Export Data table" to export the data table as an Excel file.

Now your workflow is done!

Tips about parameters and configuration

The app’s input parameters include three key items: pageNumber (a required number field, used to set the total number of pages to scrape), productName (a required field, your target search keyword), and File_path (an optional field, the save path for the final Excel file).

This is the "App Input Parameter Settings" interface—its core function is to configure required parameters before running the application. Once set, these parameters automatically sync to the corresponding variables in the workflow, directly driving subsequent collection operations (no need to manually modify variables inside the process).

Learn more about automation basics: Variables & Parameters

After capturing related elements in a single loop execution, you first need to extract actual data from the target attributes of the web element object (such as text, numbers), then call the Write to row function to store the data into the Data table.

A critical point: The column names corresponding to the data to be written must exactly match the column names of the preset data table variables (custom variables), and the extracted data must match the field type defined by the column name.

Frequently Asked Questions and Troubleshooting

Q1: The process gets stuck due to page loading timeout.

Problem: The process gets stuck while waiting for the page to load, with a timeout prompt.
Cause: Network fluctuations, excessively short timeout period.
Solution: This usually happens because of network fluctuations or an excessively short timeout period. To fix it, check your network connection, set the loading timeout period to 30-60 seconds, and enable refresh upon timeout.

Q2: "Element not found" when locating products or sub-elements.

Problem: When traversing products and locating sub-elements, the prompt "Element not found" appears.
Reason: Amazon page fine-tuning resulted in invalid parent element association and relative XPath changes.
Solution: This happens when Amazon fine-tunes its page layout, which causes invalid parent element associations or changes to relative XPath. To resolve it, first check if the parent element of the product card is still valid, then verify if your relative XPath has changed. You can also scroll the page smoothly and wait for products to fully load before attempting to locate elements.

Q3: Excel data export failed

Problem: The export instruction prompts "Failed to write file".
Reasons: The file is occupied, the path has no permission / contains special characters.
Solution: Common reasons include the target file being occupied, insufficient permissions for the save path, or special characters in the path name. To fix it, close any open copies of the Excel file first. Choose the Desktop or Documents folder as the save path—stick to using only words and numbers in the path name (no special characters).

Summary

In this tutorial, you've connected the dots from custom inputs—like pages, keywords, and paths—to a full collection workflow. You looped through pages and elements, used relative XPath for reliable positioning, and handled exceptions with 'If' statements. Remember, always extract data from element attributes first, then match them to your table columns for seamless storage. Finally, exporting to Excel closes the loop. Now, apply this to your own projects—you’ll see how it transforms your rule design into efficient data gathering.

Automation basics: Variables & Parameters

3. Multiple Command Data Extraction

Looping Through Similar Elements: Automate Repetitive Patterns

Data Patterns in HTML: Master Lists, Tables, and Pagination

Python Variables and Data Types: The Engine of Data Flow