Therabody Promo Codes: 15% Off March 2026
Save on the science-backed devices you’ve been eyeing with 15% off Theragun discount code and 30% off other great deals.
Save on the science-backed devices you’ve been eyeing with 15% off Theragun discount code and 30% off other great deals.
Whether you’re shopping for a ThinkPad, Yoga laptop, or Legion gaming PC, these Lenovo discount codes and promotions can help you save big on your next tech upgrade.
Save 20% on best-selling mattresses with our top Tuft & Needle coupon codes.
Students can get a Hulu plan for $1.99 per month. Get more details on this and other great deals below.
Make meal prep easier for any dietary need while enjoying great savings with our hand-picked Factor discount codes this March.
Save up to 60%, plus an extra 20% with HP promo codes for laptops, printers, PCs, and more tech.
Get 20% off your next website, 10% off with exclusive Squarespace discount code, 50% off plans, and more top coupons from WIRED.
Save on top services at LegalZoom, like LLC registration, incorporation, estate plans, and more with coupons and deals from WIRED.
Save with Verizon coupon codes for $1,100 off Galaxy S25 phones, free iPhone 17 Pros, and up to 50% off plans.
We've been building data pipelines that scrape websites and extract structured data for a while now. If you've done this, you know the drill: you write CSS selectors, the site changes its layout, everything breaks at 2am, and you spend your morning rewriting parsers.
LLMs seemed like the obvious fix — just throw the HTML at GPT and ask for JSON. Except in practice, it's more painful than that:
- Raw HTML is full of nav bars, footers, and tracking junk that eats your token budget. A typical product page is 80% noise. - LLMs return malformed JSON more often than you'd expect, especially with nested arrays and complex schemas. One bad bracket and your pipeline crashes. - Relative URLs, markdown-escaped links, tracking parameters — the "small" URL issues compound fast when you're processing thousands of pages. - You end up writing the same boilerplate: HTML cleanup → markdown conversion → LLM call → JSON parsing → error recovery → schema validation. Over and over.
We got tired of rebuilding this stack for every project, so we extracted it into a library.
Lightfeed Extractor is a TypeScript library that handles the full pipeline from raw HTML to validated, structured data:
- Converts HTML to LLM-ready markdown with main content extraction (strips nav, headers, footers), optional image inclusion, and URL cleaning - Works with any LangChain-compatible LLM (OpenAI, Gemini, Claude, Ollama, etc.) - Uses Zod schemas for type-safe extraction with real validation - Recovers partial data from malformed LLM output instead of failing entirely — if 19 out of 20 products parsed correctly, you get those 19 - Built-in browser automation via Playwright (local, serverless, or remote) with anti-bot patches - Pairs with our browser agent (@lightfeed/browser-agent) for AI-driven page navigation before extraction
We use this ourselves in production at Lightfeed, and it's been solid enough that we decided to open-source it.
GitHub: https://github.com/lightfeed/extractor npm: npm install @lightfeed/extractor Apache 2.0 licensed.
Happy to answer questions or hear feedback.
Comments URL: https://news.ycombinator.com/item?id=47526486
Points: 24
# Comments: 12