
Show HN: Extractor API v1 – the text extraction workflow API and UI - copypirate
https://extractorapi.com/
======
copypirate
Hey HN - coming at this solo, my first SaaS, after years of being a freelance
copywriter and content manager.

A few years back I discovered Python, and very quickly after NLP. To a writer
with a love for sci-fi/tech, I was enamored, and spent ungodly hours on my
employer's GCC-hosted Jupyter Notebooks, coming up with all sorts of
impractical experiments with Spacy, Facebook's Starspace, Gensim, and the
like.

For one, I needed a lot of training data. I'd go crawl thousands of pages of
text from news sites, using Scrapy and storing data directly on the server.
For text extraction and boilerplate removal, I used newspaper3k, and
eventually a custom extractor that used a random forest model to select proper
element "candidates".

I wanted a simpler way to aggregate text for a dataset, query it, create
subsets based on keywords, and so on. The paid options out there - Diffbot,
Aylien, Ujeebu, Scrapinghub's news API, etc. weren't exactly what I was
looking for.

After learning the minimum amount of JS required, I built a shitty local app
where you could paste a bunch of URLs and get back a JSON with the extracted
text. I posted it up here, on HN, and there were a few hundred visits,
absolutely demolishing the $5 DO instance. I figured others might want
something like this.

So I built extractorapi.com - a text extraction API and UI that revolves
around the idea of "Jobs". For example, let's say you gathered a list of URLs
from the NY Times, or The Economist, or Bloomberg. You then provide that list
of URLs to a job called "my_articles". For example:

api_key = "YOUR_API_KEY" endpoint =
"[https://extractorapi.com/api/v1/jobs"](https://extractorapi.com/api/v1/jobs")

headers = { "Authorization": f"Bearer {api_key}" } data = { "job_name":
"my_job", "url_list": [ "example.com/article1", "example.com/article2", ... ]
}

r = requests.post(endpoint, headers=headers, data=data)

This job will then process your input URLs server-side, and once complete, you
can query the extracted text or title within the job. All jobs and extracted
text are saved on your account - you can use the API or the web app to explore
the jobs you started programmatically, download them in .csv or .json formats,
and check their progress.

I go into more detail in this Medium piece, "Creating an Automated Text
Extraction Workflow": [https://medium.com/@aleks_82234/creating-an-automated-
text-e...](https://medium.com/@aleks_82234/creating-an-automated-text-
extraction-workflow-part-1-6f2197d50749)

I get that it's hard to market to devs like myself (why buy when you can
build?), so I'm looking for any feedback/criticism/suggestions on your
experience with Extractor API. As Faulkner points out, you must kill your
darlings - let me know if all this shit doesn't make any sense, or if it's
actually helpful. Or somewhere in-between.

