TechBrief — بروزترین اخبار تکنولوژی

TechBrief — تازه‌ترین اخبار فناوری

مرجع روزانه خلاصهٔ اخبار و تحلیل‌های کوتاه از منابع معتبر.

آخرین خبرها

Therabody Promo Codes: 15% Off March 2026

Save on the science-backed devices you’ve been eyeing with 15% off Theragun discount code and 30% off other great deals.

Lenovo Coupon Codes and Deals: $5,000+ Off

Whether you’re shopping for a ThinkPad, Yoga laptop, or Legion gaming PC, these Lenovo discount codes and promotions can help you save big on your next tech upgrade.

Tuft & Needle Promo Codes: 20% Off | March 2026

Save 20% on best-selling mattresses with our top Tuft & Needle coupon codes.

Hulu Promo Codes & Discounts: 20% Off in March

Students can get a Hulu plan for $1.99 per month. Get more details on this and other great deals below.

Factor Promo Code: 50% Off Off Meal Prep

Make meal prep easier for any dietary need while enjoying great savings with our hand-picked Factor discount codes this March.

60% HP Discount Codes & Coupons March 2026

Save up to 60%, plus an extra 20% with HP promo codes for laptops, printers, PCs, and more tech.

20% Squarespace Promo Codes | March 2026

Get 20% off your next website, 10% off with exclusive Squarespace discount code, 50% off plans, and more top coupons from WIRED.

LegalZoom Promo Code: Exclusive 10% Off LLC Formations

Save on top services at LegalZoom, like LLC registration, incorporation, estate plans, and more with coupons and deals from WIRED.

50% Off Verizon Promo Codes | March 2026

Save with Verizon coupon codes for $1,100 off Galaxy S25 phones, free iPhone 17 Pros, and up to 50% off plans.

Show HN: Robust LLM Extractor for Websites in TypeScript

We've been building data pipelines that scrape websites and extract structured data for a while now. If you've done this, you know the drill: you write CSS selectors, the site changes its layout, everything breaks at 2am, and you spend your morning rewriting parsers.

LLMs seemed like the obvious fix — just throw the HTML at GPT and ask for JSON. Except in practice, it's more painful than that:

- Raw HTML is full of nav bars, footers, and tracking junk that eats your token budget. A typical product page is 80% noise. - LLMs return malformed JSON more often than you'd expect, especially with nested arrays and complex schemas. One bad bracket and your pipeline crashes. - Relative URLs, markdown-escaped links, tracking parameters — the "small" URL issues compound fast when you're processing thousands of pages. - You end up writing the same boilerplate: HTML cleanup → markdown conversion → LLM call → JSON parsing → error recovery → schema validation. Over and over.

We got tired of rebuilding this stack for every project, so we extracted it into a library.

Lightfeed Extractor is a TypeScript library that handles the full pipeline from raw HTML to validated, structured data:

- Converts HTML to LLM-ready markdown with main content extraction (strips nav, headers, footers), optional image inclusion, and URL cleaning - Works with any LangChain-compatible LLM (OpenAI, Gemini, Claude, Ollama, etc.) - Uses Zod schemas for type-safe extraction with real validation - Recovers partial data from malformed LLM output instead of failing entirely — if 19 out of 20 products parsed correctly, you get those 19 - Built-in browser automation via Playwright (local, serverless, or remote) with anti-bot patches - Pairs with our browser agent (@lightfeed/browser-agent) for AI-driven page navigation before extraction

We use this ourselves in production at Lightfeed, and it's been solid enough that we decided to open-source it.

GitHub: https://github.com/lightfeed/extractor npm: npm install @lightfeed/extractor Apache 2.0 licensed.

Happy to answer questions or hear feedback.


Comments URL: https://news.ycombinator.com/item?id=47526486

Points: 24

# Comments: 12

دسته‌بندی‌ها

معمولی: گجت‌ها، نرم‌افزار، امنیت، AI، استارتاپ