Skip to main content

Lead Generation with Web Scraping: Data, Sources & CRM Workflow (2026)

· 6 min read
Yassine El Haddad
Software Developer & Automation Specialist

I build production AI agents, web scrapers, and automation pipelines. Most of what I publish here comes from the actual problems they run into: proxies that get banned, anti-bot stacks that fingerprint your client, RAG that drifts when the underlying data moves. Stack: Python, TypeScript, Go, FastAPI, LangChain, Crawlee, Playwright, deployed on AWS, GCP, and Cloudflare.

Purchased lists go stale fast, overlap with competitors, and bounce. Scraping public business and professional signals—then enriching, validating, and syncing—gives you a pipeline you control.

Quick Answer

Web scraping automates B2B lead collection from Google Maps, LinkedIn, company directories, and websites. Tools like Apify turn hours of manual prospecting into minutes of configured runs plus quality checks.

This article covers what data to collect, where to get it, an automation workflow, enrichment, and CRM integration. For ready-made tools, start with lead generation in the Apify Store.

What data to collect (and why)

FieldWhy it mattersTypical source
Company / place nameMessaging, dedupeMaps, directories
Website / domainKey for email guessing & tech lookupMaps, footer, LinkedIn
Location & categoryICP fitMaps, industry sites
PhoneCall workflows, dedupeMaps, site
Role / titlePersonalizationLinkedIn, listings
Person nameSequencesLinkedIn, bylines
Source URLAudit trailEvery scrape

Rule: one stable business key (domain + region, or Maps placeId where available) before you enrich people.

Best sources for B2B leads

  1. Google Maps — Local density: restaurants, clinics, agencies, contractors. Structured names, phones, sites, categories.
  2. LinkedIn — Titles and companies for decision-makers (stay on public data and tool terms).
  3. Company directories & niche listings — Industry portals, review sites, registries (search the Store for the domain).
  4. Company websites — Contact pages, team pages, generic inboxes, social links.

Browse lead-generation Actors →

Hard targets (strict anti-bot, large residential needs) sometimes pair Apify runs with dedicated proxy vendors such as Bright Data; most SMB lead gen works with default Apify platform options.

Automation workflow (step by step)

Use this as a template; swap Actors to match your sources.

StepActionOutput
1Define ICP — geo, category, company size, title keywordsWritten filter rules
2Discover — run a Google Maps or directory Actor with tight queriesTable of businesses + websites
3Extract contacts from sites — feed website URLs to a contact / email ActorEmails, phones, socials
4Add humans — optional LinkedIn Actor pass on company or role queriesNames, titles, profile URLs
5Normalize — one row per business; standard domains & phonesClean CSV/JSON
6Enrich — email finders (e.g. Hunter, Apollo) from name + domainGuessed emails + scores
7Validate — SMTP or verification vendor“Valid” only for outreach
8Dedupe — domain + market or place IDCRM-safe file
9Sync — webhook → Make, Zapier, or n8n → HubSpot, Pipedrive, SalesforceLive CRM

Orchestration sketch:

Maps / directory Actor → dataset → contact Actor → (optional) LinkedIn Actor → enrich → validate → CRM

On Apify, each step is a run; use webhooks or scheduled exports to avoid manual downloads.

Example chain (local B2B)

  1. Input: query commercial HVAC contractors, city, max results (start small, e.g. 25).
  2. Maps run: collect name, phone, website, category, rating.
  3. Website pass: Contact Details Scraper or Email & Phone Extractor on each website.
  4. Filter: drop rows with no domain; flag info@ / contact@ vs person-like patterns if your team cares.
  5. Validate before any cold email.

For Maps-specific setup, see Scrape Google Maps. For LinkedIn boundaries, see Scrape LinkedIn.

Enrichment: what “done” looks like

  • Minimum viable lead: company + domain + validated channel (email or phone you’re allowed to use).
  • Sales-ready lead: + role/title + LinkedIn or proof of seniority + ICP tags.
  • Never skip validation for email: high bounces hurt domain reputation (many teams aim < 5% bounce).

CRM integration

  • No-code: Apify webhook on successful run → Zapier / Make → create or update Company + Contact.
  • Automation-heavy: n8n + Apify for branching (e.g. only if email valid).
  • Custom: Pull Apify dataset API from your own worker and upsert into CRM APIs.

Map fields explicitly: domain → company match, email → contact key, source URL → custom property for compliance review.

Apify Affiliate Banner 728x90Apify Affiliate Banner 728x90Apify Affiliate Banner 300x50Apify Affiliate Banner 300x50

Lead quality vs volume

One hundred validated, ICP-tight leads beat ten thousand generic rows. Filter early on category, geography, rating, and title keywords—not only after you pay for enrichment.

Start on the Store

Use category search to find maintained Actors before writing custom code.



Lead generation — Apify Store → · Sign up on Apify →

Laws and platform terms vary by country and channel. Use public data, respect robots/terms where they apply, document purpose, and give recipients clear opt-out for email. This is not legal advice—see Is web scraping legal? and counsel for your markets.

Frequently Asked Questions

Google Maps for local businesses (name, phone, website, category), LinkedIn for public professional signals (titles, companies), niche directories for industry lists, and company websites for published contacts. Chain sources so each step adds a field you actually use.

Apify runs pre-built Actors on a schedule, stores results in datasets, and connects to Zapier, Make, n8n, and APIs—so discovery, extraction, and handoff to CRM require far less manual copy-paste than browser-only workflows.

LinkedIn typically does not expose verified personal emails in public views. Common pattern: scrape public name + company, derive the corporate domain, then use an email finder and validator. Always comply with LinkedIn’s terms and applicable privacy laws.

Mailbox providers track bounces. Sustained high bounce rates hurt deliverability for your whole domain. Validation (NeverBounce, ZeroBounce, or similar) reduces risk before CRM import or sequences.

Normalize domains, standardize phone formats, dedupe on domain plus market or a stable place ID from Maps, then use CRM upsert APIs or matching rules in your automation tool.

It depends on jurisdiction, data type (B2B vs personal), and how you use and market to contacts. GDPR, CAN-SPAM, CASL, and platform terms all matter. Consult qualified counsel for your use case; our legality guide is an overview only.

Common mistakes and fixes

Runs return businesses but almost no emails.

Expect many SMB sites to hide personal emails; chain a contact-details Actor on the website column and allow role-based inboxes only if your playbook accepts them.

LinkedIn data is incomplete or runs fail.

Stay within public data and each Actor’s limits; reduce concurrency and scope; read the run log for rate or session errors.

CRM is full of duplicates after import.

Normalize domains (strip www), standardize phones, dedupe on domain + geo or company ID before sync.