Logo

dev-resources.site

for different kinds of informations.

πŸ•ΈοΈ How to Scrape Indonesian Public Company Profiles

Published at
12/12/2024
Categories
webscraping
tutorial
python
beginners
Author
Raka Widhi Antoro
πŸ•ΈοΈ How to Scrape Indonesian Public Company Profiles

🌐 Understanding the Challenge of Financial Data Accessibility

In the digital age, accessing structured financial information about public companies can be challenging, especially in emerging markets like Indonesia. While the Indonesia Stock Exchange (IDX) provides comprehensive company profiles, the data is often:

  • 🧩 Scattered across multiple web pages
  • 🌍 Primarily in the Indonesian language
  • πŸ“Š Not readily available in machine-readable formats

πŸš€ The Need for Automated Data Collection

Financial analysts, researchers, and investors frequently encounter barriers when trying to:

  • πŸ“ˆ Compile comprehensive company information
  • πŸ”„ Translate and standardize company data
  • πŸ“ Create datasets for market research or investment analysis

πŸ€– Theoretical Approach to Web Scraping

Web scraping is a powerful technique for extracting structured data from websites. Our approach focuses on several key principles:

  1. 🀲 Automated Data Extraction

    • Eliminate manual data entry
    • Reduce human error
    • Enable rapid, repeatable data collection
  2. 🌐 Dynamic Web Interaction

    • Use Selenium WebDriver to simulate human-like browser interactions
    • Handle dynamic content loading
    • Navigate complex web structures
  3. 🌈 Data Translation and Standardization

    • Convert Indonesian field names to English
    • Create a consistent, machine-readable data format
    • Improve data interoperability

🧩 Technical Challenges and Solutions

Challenge: Multilingual Data Extraction

  • 🌍 Problem: Company information is primarily in Indonesian
  • πŸ” Solution: Implement a translation mapping for key terms

Challenge: Dynamic Web Content

  • ⚑ Problem: Websites use JavaScript to load content
  • ⏳ Solution: Use WebDriverWait to ensure complete page loading

Challenge: Robust Error Handling

  • πŸ›‘οΈ Problem: Inconsistent web page structures
  • πŸ”§ Solution: Implement flexible data extraction with fallback mechanisms

πŸ› οΈ Implementation Strategy

Our Python script will:

  • Use Selenium WebDriver for web automation
  • Extract company profile data
  • Translate field names
  • Save data in a standardized JSON format

πŸ“ Python Code Implementation

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import json

def translate_key(key):
    """
    Translate key from Indonesian to English using a predefined dictionary.

    Args:
    - key (str): The key in Indonesian to be translated.

    Returns:
    - str: The translated key in English, or the original key if no translation is found.
    """
    # Dictionary to translate key from Indonesian to English
    translations = {
        "Nama": "name",
        "Kode": "code",
        "Alamat Kantor": "office_address",
        "Alamat Email": "email",
        "Telepon": "phone",
        "Fax": "fax",
        "NPWP": "tax_id",
        "Situs": "website",
        "Tanggal Pencatatan": "listing_date",
        "Papan Pencatatan": "board",
        "Bidang Usaha Utama": "main_business",
        "Sektor": "sector",
        "Subsektor": "subsector",
        "Industri": "industry",
        "Subindustri": "subindustry",
        "Biro Administrasi Efek": "share_registrar"
    }
    return translations.get(key, key)  # Return original key if translation not found

def scrape_idx_profile(code_stock):
    """
    Scrape company profile data from IDX website.

    Args:
    - code_stock (str): Stock code (e.g., 'BBCA') to scrape data for.

    Returns:
    - dict: A dictionary containing the scraped company data in English.
    """
    chrome_options = Options()
    chrome_options.add_argument("--headless")
    chrome_options.add_argument("--no-sandbox")
    chrome_options.add_argument("--disable-dev-shm-usage")
    chrome_options.add_argument("--window-size=1920,1080")
    chrome_options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36")

    try:
        driver = webdriver.Chrome(options=chrome_options)
        url = f"https://www.idx.co.id/id/perusahaan-tercatat/profil-perusahaan-tercatat/{code_stock}"
        print(f"Accessing URL: {url}")

        driver.get(url)
        time.sleep(5)

        # Dictionary to store data
        company_data = {}

        try:
            # Wait for element with class 'bzg' to appear
            wait = WebDriverWait(driver, 10)
            bzg_element = wait.until(EC.presence_of_element_located((By.CLASS_NAME, "bzg")))

            # Get all tables within 'bzg'
            tables = bzg_element.find_elements(By.TAG_NAME, "table")

            # Process each table
            for table in tables:
                rows = table.find_elements(By.TAG_NAME, "tr")
                for row in rows:
                    try:
                        # Get the field name (td with class td-name)
                        field_name = row.find_element(By.CLASS_NAME, "td-name").text.strip()

                        # Get the content (td with class td-content)
                        content_element = row.find_element(By.CLASS_NAME, "td-content")

                        # Check if there is a link within the content
                        try:
                            content = content_element.find_element(By.TAG_NAME, "a").text.strip()
                        except:
                            content = content_element.find_element(By.TAG_NAME, "span").text.strip()

                        # Translate key to English and store in dictionary
                        english_key = translate_key(field_name)
                        company_data[english_key] = content

                    except Exception as e:
                        continue

        except Exception as e:
            print(f"Error processing bzg element: {str(e)}")

        # Save data to JSON file
        with open(f'data_{code_stock}.json', 'w', encoding='utf-8') as f:
            json.dump(company_data, f, ensure_ascii=False, indent=4)

        print(f"\nData successfully saved to data_{code_stock}.json")

        # Print data preview
        print("\nScraped data preview:")
        for key, value in company_data.items():
            print(f"{key}: {value}")

        return company_data

    except Exception as e:
        print(f"An error occurred: {str(e)}")
        return None

    finally:
        driver.quit()

if __name__ == "__main__":
    code_stock = "BBCA"  # Can be changed to other stock codes
    result = scrape_idx_profile(code_stock)

πŸ“š Code Walkthrough and Design Patterns

1. Translation Mechanism

The translate_key() function demonstrates a dictionary-based translation approach:

  • Maps Indonesian financial terms to English
  • Provides a fallback for unmapped terms
  • Ensures consistent terminology across extracted data

2. Robust Web Scraping

The scrape_idx_profile() function implements several resilience strategies:

  • Headless browser configuration
  • Explicit waits for page elements
  • Flexible content extraction
  • Comprehensive error handling

3. Data Standardization

  • Converts multilingual data to a uniform format
  • Generates machine-readable JSON output
  • Preserves original data integrity

πŸš€ Practical Applications

This script can be used for:

  • πŸ“Š Financial research
  • πŸ” Market analysis
  • πŸ’Ό Investment due diligence
  • πŸŽ“ Academic research on Indonesian public companies

βš–οΈ Ethical Considerations and Limitations

🀝 Responsible Scraping

  • Respect website terms of service
  • Implement rate limiting
  • Use scraping ethically and legally

⚠️ Disclaimer

This tool is for educational purposes. Always verify data accuracy and comply with legal and ethical guidelines when scraping web content.

Featured ones: