Skip to content

Latest commit

 

History

History
290 lines (257 loc) · 14.9 KB

File metadata and controls

290 lines (257 loc) · 14.9 KB

Himalayas Job & Company Scraper

Extract structured remote job listings and company profiles from Himalayas search pages and company libraries in one consistent dataset. Built for teams that need reliable Himalayas job data for aggregation, enrichment, and hiring intelligence without manual copying.

Bitbash Banner

Telegram   WhatsApp   Gmail   Website

Created by Bitbash, built to showcase our approach to Scraping and Automation!
If you are looking for himalayas-job-company-scraper you've just found your team — Let’s Chat. 👆👆

Introduction

This project collects remote job listings and company profiles from Himalayas using a single search URL input for either Jobs or Companies. It helps turn messy browsing into structured, reusable data for recruitment workflows, market research, and enrichment pipelines. It’s designed for job boards, HR tech teams, analysts, and automation builders who need repeatable extraction with clean fields.

Built for remote hiring pipelines

  • Accepts a single search URL for job results or company directories
  • Extracts job + company details into normalized objects (easy to store and dedupe)
  • Captures salary ranges, tech stacks, benefits, and social profiles when available
  • Supports frequent re-runs to track updates, expirations, and newly posted roles
  • Outputs both human-readable text fields and rich HTML where useful

Features

Feature Description
Job search URL support Pull job listings from any supported search results page with filters already applied.
Company directory URL support Extract company profiles from company library pages or filtered directories.
Salary range capture Collects min/max salary and currency when published.
Tech stack extraction Reads technologies, tools, and stack categories for each company when listed.
Benefits cataloging Extracts company benefits with category and description for benchmarking.
Social & website enrichment Collects website and social account links for outreach and enrichment.
Job metadata normalization Captures created/updated timestamps, expiration, categories, skills, and restrictions.
Promoted job detection Flags stickied/promoted listings so they can be filtered or labeled in downstream tools.
Canonical URL + GUID fields Stores stable identifiers to prevent duplicates and support incremental syncs.

What Data This Scraper Extracts

Field Name Field Description
slug URL-safe identifier for the job or company.
title Job title (for jobs) or company name (for companies).
employmentType Job type such as full-time, contract, etc.
minSalary Minimum salary value when available.
maxSalary Maximum salary value when available.
currency Salary currency code (e.g., USD).
applicationLink Direct application URL when present.
locationRestrictions Allowed/limited locations listed on the job.
timezoneRestrictions Allowed/limited timezones (often numeric offsets).
createdAt Original posting time when available.
updatedAt Last update timestamp (job edits, listing updates).
expiryDate Expiration date when provided.
isStickied Whether the job is promoted/stickied.
parentCategories Higher-level category grouping for the role.
categories Job categories/tags as listed.
skills Skills/technologies required or recommended.
guid Canonical URL identifier for stable reference.
description_html Rich job description HTML.
description_text Cleaned plain-text job description.
company.name Employer name associated with the job.
company.employeeRange Employee count band (e.g., 1-10, 1001-5000).
company.summary Short company summary/one-liner.
company.about Long-form company description (HTML).
company.externalLink Company website URL when available.
company.internalLink Company profile URL.
company.logo Company logo URL.
company.yearFounded Year founded when listed.
company.ceo CEO name when listed.
company.locations Country/region objects where the company operates.
company.markets Markets/industries tags for the company.
company.isVerified Whether the company profile is verified.
company.liveJobsCount Number of active job listings for the company.
company.liveJobSlugs Slugs for active jobs tied to the company.
company.benefits Benefits array with title, description, category.
company.stacks Tech stack array with title, summary, logo, category.
company.twitter Twitter/X profile URL if present.
company.linkedin LinkedIn profile URL if present.
company.facebook Facebook profile URL if present.
company.instagram Instagram profile URL if present.

Example Output

[
      {
        "slug": "remote-administrative-assistant",
        "title": "Remote - Administrative Assistant",
        "employmentType": "Full Time",
        "minSalary": 20000,
        "maxSalary": 25000,
        "currency": "USD",
        "applicationLink": "https://himalayas.app/apply/fgpvm",
        "locationRestrictions": [],
        "timezoneRestrictions": [ -8, -7, -6, -5, -4 ],
        "createdAt": "2024-10-16 12:53:19",
        "updatedAt": "2024-10-28 07:30:07",
        "expiryDate": "2024-11-15 12:50:41",
        "isStickied": true,
        "parentCategories": [ "Human Resources" ],
        "categories": [
              "Remote-Administrative-Assistant",
              "Administrative-Assistant",
              "Virtual-Assistant",
              "Executive-Assistant"
        ],
        "skills": [
              "Administrative-Support",
              "Project-Management",
              "MS-Office-Suite",
              "Remote-Collaboration"
        ],
        "guid": "https://himalayas.app/companies/infrasync-technology-services/jobs/remote-administrative-assistant",
        "company": {
              "name": "Infrasync Technology Services",
              "slug": "infrasync-technology-services",
              "employeeRange": "1-10",
              "isVerified": true,
              "logo": "https://cdn-images.himalayas.app/l6y9d2uqx7o85917rznbk97ucczm",
              "internalLink": "https://himalayas.app/companies/infrasync-technology-services",
              "externalLink": "https://infrasync.com?utm_source=himalayas.app&utm_medium=himalayas.app&utm_campaign=himalayas.app&ref=himalayas.app&source=himalayas.app",
              "yearFounded": 2024,
              "ceo": "Andrew Swirsky",
              "liveJobsCount": 1,
              "liveJobSlugs": [ "remote-administrative-assistant" ],
              "linkedin": "https://www.linkedin.com/company/98777116"
        }
      }
]

Directory Structure Tree

Himalayas Job & Company Scraper (IMPORTANT :!! always keep this name as the name of the apify actor !!! Himalayas Job & Company Scraper )/
├── src/
│   ├── main.py
│   ├── runner.py
│   ├── cli.py
│   ├── config/
│   │   ├── settings.py
│   │   └── logging.yaml
│   ├── core/
│   │   ├── browser.py
│   │   ├── routes.py
│   │   ├── validators.py
│   │   └── retry.py
│   ├── extractors/
│   │   ├── jobs_extractor.py
│   │   ├── companies_extractor.py
│   │   ├── job_detail_parser.py
│   │   ├── company_detail_parser.py
│   │   └── html_to_text.py
│   ├── models/
│   │   ├── job.py
│   │   ├── company.py
│   │   └── common.py
│   ├── normalization/
│   │   ├── salary.py
│   │   ├── tags.py
│   │   ├── dates.py
│   │   └── dedupe.py
│   ├── outputs/
│   │   ├── dataset_writer.py
│   │   ├── jsonl_exporter.py
│   │   └── csv_exporter.py
│   └── utils/
│       ├── urls.py
│       ├── hashing.py
│       └── timers.py
├── tests/
│   ├── test_jobs_parser.py
│   ├── test_company_parser.py
│   └── test_normalization.py
├── data/
│   ├── input.sample.json
│   └── output.sample.json
├── .env.example
├── requirements.txt
├── pyproject.toml
├── README.md
└── LICENSE

Use Cases

  • Job board operators use it to collect Himalayas job data at scale, so they can publish searchable listings with consistent fields and fewer duplicates.
  • Recruitment agencies use it to extract company profiles and open roles, so they can enrich leads and speed up outreach.
  • HR tech teams use it to feed ATS/CRM pipelines, so they can automate sourcing and keep listings fresh with scheduled re-runs.
  • Market analysts use it to track salaries, skills, and hiring trends, so they can compare demand across roles, regions, and timezones.
  • SaaS enrichment workflows use it to capture websites and social profiles, so they can build better firmographic datasets for sales ops.

FAQs

Q1) What input should I provide to start extracting data? Provide a single Himalayas search URL for either job search results or the company directory. Use the website filters (role, category, location, timezone) first, then paste the filtered URL so the extractor mirrors your selection.

Q2) Why are some jobs missing salary or some companies missing benefits/tech stacks? Not every listing publishes salary ranges, and some company profiles are incomplete. The extractor returns null/empty values when fields are not available so downstream systems can handle partial enrichment safely.

Q3) How do I avoid duplicates when I run this frequently? Use stable identifiers like guid, slug, and company.slug as primary keys. Store company IDs/slugs and perform upserts instead of inserts. For job updates, use updatedAt and expiryDate to sync changes and remove expired roles.

Q4) Can I extract only companies or only jobs? Yes. Use a company directory/search URL to focus on companies, or a jobs search URL to focus on jobs. If your workflow needs both, run two inputs and join on company.slug or company.internalLink.


Performance Benchmarks and Results

Primary Metric: ~35–70 listings/min on typical search pages, depending on filters and the number of detail pages required per item.

Reliability Metric: 96–99% successful item completion on stable runs, with automatic retries handling intermittent navigation and network blips.

Efficiency Metric: ~250–450 MB RAM average during active extraction, with throughput scaling mainly by concurrent page processing and the number of detail pages visited.

Quality Metric: 85–95% field completeness for core fields (slug, title, URLs, timestamps), while optional enrichment fields (salary, benefits, stacks, socials) vary based on profile completeness.

Book a Call Watch on YouTube

Review 1

"Bitbash is a top-tier automation partner, innovative, reliable, and dedicated to delivering real results every time."

Nathan Pennington
Marketer
★★★★★

Review 2

"Bitbash delivers outstanding quality, speed, and professionalism, truly a team you can rely on."

Eliza
SEO Affiliate Expert
★★★★★

Review 3

"Exceptional results, clear communication, and flawless delivery.
Bitbash nailed it."

Syed
Digital Strategist
★★★★★