Extract structured remote job listings and company profiles from Himalayas search pages and company libraries in one consistent dataset. Built for teams that need reliable Himalayas job data for aggregation, enrichment, and hiring intelligence without manual copying.
Created by Bitbash, built to showcase our approach to Scraping and Automation!
If you are looking for himalayas-job-company-scraper you've just found your team — Let’s Chat. 👆👆
This project collects remote job listings and company profiles from Himalayas using a single search URL input for either Jobs or Companies. It helps turn messy browsing into structured, reusable data for recruitment workflows, market research, and enrichment pipelines. It’s designed for job boards, HR tech teams, analysts, and automation builders who need repeatable extraction with clean fields.
- Accepts a single search URL for job results or company directories
- Extracts job + company details into normalized objects (easy to store and dedupe)
- Captures salary ranges, tech stacks, benefits, and social profiles when available
- Supports frequent re-runs to track updates, expirations, and newly posted roles
- Outputs both human-readable text fields and rich HTML where useful
| Feature | Description |
|---|---|
| Job search URL support | Pull job listings from any supported search results page with filters already applied. |
| Company directory URL support | Extract company profiles from company library pages or filtered directories. |
| Salary range capture | Collects min/max salary and currency when published. |
| Tech stack extraction | Reads technologies, tools, and stack categories for each company when listed. |
| Benefits cataloging | Extracts company benefits with category and description for benchmarking. |
| Social & website enrichment | Collects website and social account links for outreach and enrichment. |
| Job metadata normalization | Captures created/updated timestamps, expiration, categories, skills, and restrictions. |
| Promoted job detection | Flags stickied/promoted listings so they can be filtered or labeled in downstream tools. |
| Canonical URL + GUID fields | Stores stable identifiers to prevent duplicates and support incremental syncs. |
| Field Name | Field Description |
|---|---|
| slug | URL-safe identifier for the job or company. |
| title | Job title (for jobs) or company name (for companies). |
| employmentType | Job type such as full-time, contract, etc. |
| minSalary | Minimum salary value when available. |
| maxSalary | Maximum salary value when available. |
| currency | Salary currency code (e.g., USD). |
| applicationLink | Direct application URL when present. |
| locationRestrictions | Allowed/limited locations listed on the job. |
| timezoneRestrictions | Allowed/limited timezones (often numeric offsets). |
| createdAt | Original posting time when available. |
| updatedAt | Last update timestamp (job edits, listing updates). |
| expiryDate | Expiration date when provided. |
| isStickied | Whether the job is promoted/stickied. |
| parentCategories | Higher-level category grouping for the role. |
| categories | Job categories/tags as listed. |
| skills | Skills/technologies required or recommended. |
| guid | Canonical URL identifier for stable reference. |
| description_html | Rich job description HTML. |
| description_text | Cleaned plain-text job description. |
| company.name | Employer name associated with the job. |
| company.employeeRange | Employee count band (e.g., 1-10, 1001-5000). |
| company.summary | Short company summary/one-liner. |
| company.about | Long-form company description (HTML). |
| company.externalLink | Company website URL when available. |
| company.internalLink | Company profile URL. |
| company.logo | Company logo URL. |
| company.yearFounded | Year founded when listed. |
| company.ceo | CEO name when listed. |
| company.locations | Country/region objects where the company operates. |
| company.markets | Markets/industries tags for the company. |
| company.isVerified | Whether the company profile is verified. |
| company.liveJobsCount | Number of active job listings for the company. |
| company.liveJobSlugs | Slugs for active jobs tied to the company. |
| company.benefits | Benefits array with title, description, category. |
| company.stacks | Tech stack array with title, summary, logo, category. |
| company.twitter | Twitter/X profile URL if present. |
| company.linkedin | LinkedIn profile URL if present. |
| company.facebook | Facebook profile URL if present. |
| company.instagram | Instagram profile URL if present. |
[
{
"slug": "remote-administrative-assistant",
"title": "Remote - Administrative Assistant",
"employmentType": "Full Time",
"minSalary": 20000,
"maxSalary": 25000,
"currency": "USD",
"applicationLink": "https://himalayas.app/apply/fgpvm",
"locationRestrictions": [],
"timezoneRestrictions": [ -8, -7, -6, -5, -4 ],
"createdAt": "2024-10-16 12:53:19",
"updatedAt": "2024-10-28 07:30:07",
"expiryDate": "2024-11-15 12:50:41",
"isStickied": true,
"parentCategories": [ "Human Resources" ],
"categories": [
"Remote-Administrative-Assistant",
"Administrative-Assistant",
"Virtual-Assistant",
"Executive-Assistant"
],
"skills": [
"Administrative-Support",
"Project-Management",
"MS-Office-Suite",
"Remote-Collaboration"
],
"guid": "https://himalayas.app/companies/infrasync-technology-services/jobs/remote-administrative-assistant",
"company": {
"name": "Infrasync Technology Services",
"slug": "infrasync-technology-services",
"employeeRange": "1-10",
"isVerified": true,
"logo": "https://cdn-images.himalayas.app/l6y9d2uqx7o85917rznbk97ucczm",
"internalLink": "https://himalayas.app/companies/infrasync-technology-services",
"externalLink": "https://infrasync.com?utm_source=himalayas.app&utm_medium=himalayas.app&utm_campaign=himalayas.app&ref=himalayas.app&source=himalayas.app",
"yearFounded": 2024,
"ceo": "Andrew Swirsky",
"liveJobsCount": 1,
"liveJobSlugs": [ "remote-administrative-assistant" ],
"linkedin": "https://www.linkedin.com/company/98777116"
}
}
]
Himalayas Job & Company Scraper (IMPORTANT :!! always keep this name as the name of the apify actor !!! Himalayas Job & Company Scraper )/
├── src/
│ ├── main.py
│ ├── runner.py
│ ├── cli.py
│ ├── config/
│ │ ├── settings.py
│ │ └── logging.yaml
│ ├── core/
│ │ ├── browser.py
│ │ ├── routes.py
│ │ ├── validators.py
│ │ └── retry.py
│ ├── extractors/
│ │ ├── jobs_extractor.py
│ │ ├── companies_extractor.py
│ │ ├── job_detail_parser.py
│ │ ├── company_detail_parser.py
│ │ └── html_to_text.py
│ ├── models/
│ │ ├── job.py
│ │ ├── company.py
│ │ └── common.py
│ ├── normalization/
│ │ ├── salary.py
│ │ ├── tags.py
│ │ ├── dates.py
│ │ └── dedupe.py
│ ├── outputs/
│ │ ├── dataset_writer.py
│ │ ├── jsonl_exporter.py
│ │ └── csv_exporter.py
│ └── utils/
│ ├── urls.py
│ ├── hashing.py
│ └── timers.py
├── tests/
│ ├── test_jobs_parser.py
│ ├── test_company_parser.py
│ └── test_normalization.py
├── data/
│ ├── input.sample.json
│ └── output.sample.json
├── .env.example
├── requirements.txt
├── pyproject.toml
├── README.md
└── LICENSE
- Job board operators use it to collect Himalayas job data at scale, so they can publish searchable listings with consistent fields and fewer duplicates.
- Recruitment agencies use it to extract company profiles and open roles, so they can enrich leads and speed up outreach.
- HR tech teams use it to feed ATS/CRM pipelines, so they can automate sourcing and keep listings fresh with scheduled re-runs.
- Market analysts use it to track salaries, skills, and hiring trends, so they can compare demand across roles, regions, and timezones.
- SaaS enrichment workflows use it to capture websites and social profiles, so they can build better firmographic datasets for sales ops.
Q1) What input should I provide to start extracting data? Provide a single Himalayas search URL for either job search results or the company directory. Use the website filters (role, category, location, timezone) first, then paste the filtered URL so the extractor mirrors your selection.
Q2) Why are some jobs missing salary or some companies missing benefits/tech stacks?
Not every listing publishes salary ranges, and some company profiles are incomplete. The extractor returns null/empty values when fields are not available so downstream systems can handle partial enrichment safely.
Q3) How do I avoid duplicates when I run this frequently?
Use stable identifiers like guid, slug, and company.slug as primary keys. Store company IDs/slugs and perform upserts instead of inserts. For job updates, use updatedAt and expiryDate to sync changes and remove expired roles.
Q4) Can I extract only companies or only jobs?
Yes. Use a company directory/search URL to focus on companies, or a jobs search URL to focus on jobs. If your workflow needs both, run two inputs and join on company.slug or company.internalLink.
Primary Metric: ~35–70 listings/min on typical search pages, depending on filters and the number of detail pages required per item.
Reliability Metric: 96–99% successful item completion on stable runs, with automatic retries handling intermittent navigation and network blips.
Efficiency Metric: ~250–450 MB RAM average during active extraction, with throughput scaling mainly by concurrent page processing and the number of detail pages visited.
Quality Metric: 85–95% field completeness for core fields (slug, title, URLs, timestamps), while optional enrichment fields (salary, benefits, stacks, socials) vary based on profile completeness.
