
LinkedIn News Scraping with Persistent Authentication
Building LinkedIn News Scrapers with Witrium: Authentication & Session Management
Purpose: Learn how to build authenticated web automations using Witrium's session management features to extract LinkedIn news headlines while maintaining logged-in state across multiple runs.
Legal note: Always review and comply with LinkedIn's Terms of Service and applicable laws. Use respectful rates, avoid abuse, and consider LinkedIn's API when available for commercial use cases.
What We're Building
LinkedIn's news section contains valuable industry insights, but accessing it requires authentication. Traditional scraping approaches struggle with login flows, session management, and maintaining authenticated state across multiple runs.
This tutorial demonstrates Witrium's session management capabilities: Instead of logging in every time you want to extract news data, we'll create a persistent login session that can be reused across multiple extraction runs. This approach is more efficient, respectful to the platform, and mirrors real user behavior.
The Strategy:
- Login Workflow: Handle authentication once and save the browser state
- Extraction Workflow: Reuse the saved session to extract news data on demand
- Reusable Sessions: Maintain authentication across multiple runs without re-logging in
Why Session Management Matters
Traditional Challenges with Authenticated Scraping
Most web scraping approaches fail when authentication is required:
Manual Login Required:
- Need to manually log in before each scraping session
- Difficult to automate login flows reliably
- Authentication state gets lost between runs
Session Complexity:
- Managing cookies, tokens, and browser state manually
- Handling multi-factor authentication and CAPTCHA challenges
- Dealing with session timeouts and re-authentication
Infrastructure Overhead:
- Maintaining persistent browser instances
- Managing authentication credentials securely
- Coordinating login state across multiple scraping tasks
The Witrium Session Management Advantage
Witrium solves these challenges with built-in session management:
✅ Persistent Sessions - Save and reuse authenticated browser states
✅ Secure Credential Handling - Encrypted storage for passwords and sensitive data
✅ Session Isolation - Named sessions for different accounts or use cases
✅ Automatic State Management - Browser state preserved across workflow runs
✅ Visual Session Building - Test authentication flows interactively
Prerequisites
- A Witrium account with an API token from the Witrium dashboard
- A LinkedIn account for testing
- Basic understanding of Witrium workflows (see our Amazon scraping tutorial for fundamentals)
Implementation Overview
We'll build this in two phases:
- Authentication Workflow - Handle login and save the session
- Extraction Workflow - Use the saved session to extract news data
This separation allows you to authenticate once and run extractions multiple times without re-logging in.
Phase 1: Building the Login Workflow
Step 1: Create the Authentication Workflow
- Navigate to https://witrium.com/workflows and click Create Workflow
- Name it "LinkedIn Login" and click Create Workflow
Step 2: Set the Target URL for Direct Login
Set the Target URL to LinkedIn's direct login page:
https://www.linkedin.com/login
Why this matters: By targeting the login page directly instead of the homepage, we skip unnecessary redirects and get straight to the authentication form. This makes the workflow more reliable and faster.
Step 3: Start Build Session
Click Start Build Session. Witrium opens a live browser instance showing LinkedIn's login page. You'll see the login form ready for interaction.
Step 4: Add Email Input Instruction
Add the following instruction:
In the email field enter {{email}}
Understanding Arguments: The {{email}}
syntax creates a workflow argument. You can name it anything you want. (See Working with Instructions section in the documentation for more details). During execution, you can pass different values to this argument without modifying the workflow. This makes the workflow reusable across different accounts.
After adding this instruction, you'll see a new "email" field appear on the instruction panel. Enter your LinkedIn email address and click Play to test the instruction.
Step 5: Add Password Input Instruction (Secure)
Add the password instruction:
In the password field enter {{$password}}
Secret Arguments: The $
prefix before password
marks this as a secret value. (See Working with Instructions section in the documentation for more details). Witrium:
- Stores secret values in an encrypted vault
- Automatically deletes them after workflow completion
- Never sends secret values to the underlying AI model
- Displays them as password fields (masked input)
Enter your LinkedIn password in the "$password" field and click Play.
Step 6: Submit the Login Form
Add the sign-in instruction:
Click on sign in button
Click Play to execute the login. You should see LinkedIn's authentication process begin.
Step 7: Add Session Stabilization Wait
Add a crucial wait instruction:
Wait 10 seconds
Why This Is Critical: This wait serves multiple purposes:
- Allows LinkedIn's post-login redirects to complete
- Ensures the homepage loads fully with all authentication tokens
- Provides buffer time for any additional verification steps
- Guarantees the session is in a stable state before saving
Session Management Requirement: Witrium's session saving works best when the page is fully loaded. This wait ensures reliable session preservation.
Step 8: Complete the Login Workflow
Click End Build Session to finalize the login workflow. Your authentication workflow is now ready.
Step 9: Execute and Save the Session
Now we'll run the workflow and create a persistent session:
- Click Run Workflow
- In the popup, enter your email and $password credentials
- Click on Session Management tab
- Toggle the "Preserve session" switch to ON
- Enter a unique Session Name (e.g., "Linkedin login-session")
- Click Start Run
What Happens During Execution:
- Witrium runs through all your login steps
- Completes the authentication process
- Captures the entire browser state (cookies, tokens, storage)
- Securely saves the session under your chosen name
- Makes the session available for future workflows
Managing Sessions: All your saved sessions can be managed at https://witrium.com/settings?tab=browser-sessions
Phase 2: Building the Extraction Workflow
Step 1: Create the News Extraction Workflow
- Create a new workflow named "LinkedIn News Extraction"
- For the Target URL, use LinkedIn's post login homepage, i.e. https://www.linkedin.com/feed/
Step 2: Add the Extraction Instructions
Add these instructions in order:
Instruction 1: Page Load Stabilization
Wait 10 seconds
Purpose: Ensures the LinkedIn homepage loads completely with all dynamic content, including the news section.
Instruction 2: Expand News Content
In the LinkedIn news section, click on "show more"
Purpose: LinkedIn initially shows only a few headlines. This instruction expands the view to display all available top stories for more comprehensive extraction.
Instruction 3: Data Extraction
Extract the following data for all visible top stories from the LinkedIn news section and return as JSON:
1) `headline`: the visible headline (string)
2) `time`: the time duration for that headline (string)
3) `readers`: the number of readers listed for that headline (integer)
Return a top-level JSON object: { "news": [ ... ] }
Extraction Details:
- Structured Output: Enforces consistent JSON format across runs
- Complete Coverage: Extracts all visible stories, not just featured ones
- Rich Metadata: Captures engagement metrics (readers) and recency (time)
Step 3: Test the Extraction Workflow
To test the workflow with your saved session:
- Click Run Workflow
- Navigate to the Session Management tab
- Select "Use existing session"
- Choose your saved LinkedIn session from the dropdown
- Click Start Run
What Happens:
- Witrium loads your saved browser state
- Bypasses the login process entirely
- Starts directly from your authenticated LinkedIn homepage
- Executes the extraction instructions
- Returns structured JSON data
Step 4: Review Extraction Results
After the workflow completes, you'll see the extracted news data in JSON format:
{
"news": [
{
"headline": "Tech layoffs continue as startups face funding challenges",
"time": "2h",
"readers": 15420
},
{
"headline": "Remote work policies evolving in 2025",
"time": "4h",
"readers": 8760
}
// ... more news items
]
}
Automation & Integration
Option A: REST API Integration
Each Witrium workflow generates an auto-generated REST endpoint. You can trigger the extraction workflow programmatically:
curl -X POST -H "Authorization: Bearer <YOUR_API_TOKEN>" -H "Content-Type: application/json" "https://api.witrium.com/v1/workflows/<EXTRACTION_WORKFLOW_ID>/run" -d '{
"use_states": ["Linkedin login-session"]
}'
Option B: Python SDK Integration
For more sophisticated integrations:
from witrium.client import SyncWitriumClient
API_TOKEN = "<YOUR_API_TOKEN>"
EXTRACTION_WORKFLOW_ID = "<YOUR_EXTRACTION_WORKFLOW_ID>"
def get_linkedin_news():
with SyncWitriumClient(api_token=API_TOKEN) as client:
result = client.run_workflow_and_wait(
workflow_id=EXTRACTION_WORKFLOW_ID,
use_states=["Linkedin login-session"]
)
if result.status == "COMPLETED":
news_data = result.result.get("news", [])
return news_data
else:
raise Exception(f"Extraction failed: {result.status}")
# Usage
news = get_linkedin_news()
for item in news:
print(f"• {item['headline']} ({item['readers']} readers)")
Scheduled Extractions
You can set up automated news collection using:
- Cron jobs for regular intervals
- GitHub Actions for CI/CD integration
- Cloud functions for serverless execution
- Zapier/Make.com for no-code automation
Advanced Session Management
Multiple Account Support
Create separate login workflows for different LinkedIn accounts:
# Different sessions for different accounts
personal_session = "linkedin-personal"
company_session = "linkedin-company"
industry_session = "linkedin-industry-news"
# Use appropriate session based on context
def get_news_by_account(account_type="personal"):
session_map = {
"personal": personal_session,
"company": company_session,
"industry": industry_session
}
return run_extraction_with_session(session_map[account_type])
Session Refresh Strategy
LinkedIn sessions eventually expire. Implement a refresh strategy:
def get_news_with_refresh():
try:
return get_linkedin_news()
except Exception as e:
if "authentication" in str(e).lower():
print("Session expired, refreshing...")
refresh_linkedin_session()
return get_linkedin_news()
raise e
def refresh_linkedin_session():
# Run the login workflow to refresh the session
with SyncWitriumClient(api_token=API_TOKEN) as client:
result = client.run_workflow_and_wait(
workflow_id=LOGIN_WORKFLOW_ID,
args={"email": LINKEDIN_EMAIL, "$password": LINKEDIN_PASSWORD},
preserve_session=True,
use_states=["Linkedin login-session"]
)
Best Practices & Security
Credential Security
- Always use secret arguments (
$
-prefixed arguments) for sensitive data (See Working with Instructions section in the documentation for more details). - Store your Witrium API tokens securely in environment variables
- Use dedicated accounts for automation when possible
Respectful Usage
- Add appropriate delays between requests
- Don't overwhelm LinkedIn's servers
- Respect rate limits and terms of service
- Consider using LinkedIn's official API for commercial applications
Browser Session Hygiene
- Use descriptive session names for easy management
- Regularly refresh expired sessions
- Manage your stored sessions in the Witrium dashboard (https://witrium.com/settings?tab=browser-sessions). Delete unused sessions to maintain organization
Error Handling
def robust_news_extraction():
max_retries = 3
for attempt in range(max_retries):
try:
return get_linkedin_news()
except Exception as e:
print(f"Attempt {attempt + 1} failed: {e}")
if attempt == max_retries - 1:
raise e
time.sleep(2 ** attempt) # Exponential backoff
Troubleshooting Common Issues
Session Not Working
- Check session expiry: LinkedIn sessions expire after inactivity
- Verify login success: Ensure the login workflow completed successfully
- Review wait times: Insufficient wait times can cause incomplete session saves
Extraction Failures
- LinkedIn layout changes: The news section layout may have changed, ensure it is visible on the homepage
- Content not loaded: Add or increase wait times for dynamic content
- Access restrictions: Verify your account has access to the news section
Rate Limiting
- Add delays: Include wait instructions between actions
- Reduce frequency: Don't run extraction workflows too frequently
- Monitor usage: Watch for LinkedIn's rate limiting responses
Conclusion
You now have a robust LinkedIn news extraction system with persistent authentication. The two-workflow approach provides:
- Secure credential handling with Witrium's secret management
- Persistent sessions that eliminate repeated logins
- Scalable extraction that can be automated and integrated
- Flexible architecture that supports multiple accounts and use cases
This pattern can be extended to other authenticated platforms like Twitter, Facebook, or internal company portals that require login.
Next Steps:
- Explore Witrium's workflow documentation for advanced features
- Set up automated scheduling for regular news collection
- Integrate with your existing data pipelines and analytics tools
Got questions about session management or authentication workflows? Reach out to our support team at support@witrium.com.