

Web scraping with Factor - otoburb
http://re-factor.blogspot.com/2014/04/scraping-re-factor.html

======
techaddict009
I do regular web scrapping using php and curl. This seems more interesting
time to learn factor programming language.

@otoburb can you recommend any better guide for the same?

~~~
otoburb
The Factor documentation is hosted at
[http://docs.factorcode.org](http://docs.factorcode.org). Feel free to browse!
It's the same set of docs that you also have access to locally when you
download the Factor binaries[1] or pull/compile from github[2].

The concatenative.org[3] wiki also has similar starting material and pointers.

Factor is a fun language. The blog that I linked to is written by one of the
Factor contributors.

[1] [http://factorcode.org/](http://factorcode.org/)

[2]
[https://github.com/slavapestov/factor](https://github.com/slavapestov/factor)

[3]
[http://concatenative.org/wiki/view/Factor/Learning](http://concatenative.org/wiki/view/Factor/Learning)

~~~
klibertp
Do you know about any "Factor for Forth programmers" tutorial? Factor is
similar enough to irritate me with it's "for beginners" materials, but
different (and larger) enough to make normal manuals useless by themselves.

I'd especially appreciate a bottom-up write up, starting with the stack and
cells (which feel familiar), and then introducing higher-level abstractions of
Factor.

~~~
bjourne
Better wash your brain about that implied misconception. Factor is much more
like Lisp and Haskell (esp point-free style) than Forth. I guess "learn Lisp
and/or Haskell and Factor won't seem so foreign" isn't terrific advice. But
right now there doesn't exist many newbie guides at all.

~~~
klibertp
I dunno, syntactically it resembles Forth quite a bit, what with : ; for
defining words and () for comments and all that. Anyway, I have no problem
whatsoever with high-level abstractions in Factor, nor with its concatenative
nature, nor with its macros and so on. I know all these features from other
languages. What I want is a just description of how these high-level things
map to assembly, I guess. For example I just learned that: "Internally, a
quotation is a pair, consisting of an array and a machine code entry point.
The array stores the quotation's elements" \- this is a kind of definition I
want for all the abstractions in Factor. It's probably best to go through
Slava Pestov blog and pick up such scattered descriptions, but I'd really
appreciate if someone prepared a single article with all these definitions.

~~~
csandreasen
Note that the stack comments aren't actually comments in Factor - they're part
of the function definition and are mandatory. The compiler will do a simple
check to ensure that all of your stack inputs and outputs match up for each
function call.

~~~
mrjbq7
We made stack effects mandatory for most definitions as it appeared an area of
frustration with new Factor programmers.

However, we have a stack checker that still supports optional stack effects if
you yearn for the good ol' days:

[https://github.com/slavapestov/factor/issues/887](https://github.com/slavapestov/factor/issues/887)

------
naveen99
factor has opengl built into its repl, so it kind of gives you the mythical
modern terminal in one of gary bernhardt's talks.

~~~
agumonkey
I forgot about that ... Factor impresses me a lot for there's so much packed
in it. one man effort it seems ... even more impressive.

~~~
mrjbq7
Although it feels likes a one programmer effort, there have been a few of us
contributing consistently over the years:

[https://github.com/slavapestov/factor/graphs/contributors](https://github.com/slavapestov/factor/graphs/contributors)

~~~
agumonkey
Thanks for showing me this, I was greatly misguided.

------
riffraff
somewhat relative to factor: whatever happened to slava pestov?

I used to follow his blog about factor with interest some years ago and then
all of a sudden... he was gone.

~~~
mrjbq7
He is working for Google. From the outside, it looks a lot like he disappeared
from open source...

------
danso
This didn't seem interesting to me, because there's so many ways to scrape a
site, in every high level language, that learning one more way to do it is
going to be a diminished return...and I also didn't know that "Factor" was a
language.

The OP's template could have some more info on what Factor is, but there's a
few links, including this wiki for it:
[http://concatenative.org/wiki/view/Factor/Learning](http://concatenative.org/wiki/view/Factor/Learning)

I honestly didn't know what a "concatenative" language was until I saw that
"Six programming paradigms that will change how you think about coding" post
that fronted HN last week:

[http://brikis98.blogspot.in/2014/04/six-programming-
paradigm...](http://brikis98.blogspot.in/2014/04/six-programming-paradigms-
that-will.html)

So this is just a long way of saying...after just learning about concatenative
languages, I'm really interested in what that paradigm brings to a common
task...maybe there aren't productivity gains, but I love learning different
philosophies of coding, and thanks to the OP for showing one practical
example.

~~~
mrjbq7
I added some information about the Factor language to the blog template -
great idea, thanks.

Concatenative languages are quite interesting, and I'd encourage you to try it
out. You might find it helps your thinking about certain problems.

------
notastartup
you can write an entire web scraper with just a url using
[http://scrape.ly](http://scrape.ly)

With scrape.ly I can just do this to crawl the entire HackerNews site across
pages and grab the urls and extract any data from the page it lands on without
defining any fields (it discovers them on it's own) and so doesn't require you
to 'relabel' fields when the site changes layouts. It also generates new IP
addresses on the fly so you don't get stuck and launches multiple threads for
you to speed up the process. It works fully with ajax sites and single page
apps. Flash support is coming too.

    
    
        http://scrape.ly/s/{https://news.ycombinator.com/}
        {next:More}{Space Monkey dumps Python for Go}*{fields:'Auto'}
    

Honest question (I don't mind downvoting if you disagree), but why would you
want to waste time writing web scrapers, maintaining it to run and fixing the
code? Multiply it by 100 or 1000 different websites and it becomes a full-time
job. For me, I want to get the data I need with the least possible of overhead
and as soon as possible and I don't really want to be bothered with setting up
environments and hosting for it to run and fixing bugs when sites change
layout.

~~~
Avshalom
This is not a post about web scraping. It's a post about doing something in
Factor.

~~~
notastartup

        Web scraping with Factor

~~~
bunderbunder
Are you familiar with the idea of implementing common problems for the sake of
pedagogy? For example, someone who might want to demonstrate how a particular
programming language can be used might start a blog, and in that blog said
person might post articles demonstrating how you could attack a particular
problem in that language.

Your criticism of this post comes across as tone-deaf. You might as well have
written the editors of _Beautiful Code_ to lecture them about how the chapter
on quicksort is horribly misguided and that everything a good software
craftsman should ever care to know on the subject can be found at
[http://docs.oracle.com/javase/7/docs/api/java/util/Arrays.ht...](http://docs.oracle.com/javase/7/docs/api/java/util/Arrays.html#sort\(int\[\]))

~~~
notastartup
Honestly, I meant no harm. I saw that we were talking about web scraping in
other languages like PHP and Python, and I wanted to add on to the idea above
that Factor doesn't really provide additional value than any other
implementation of the job in another language would. They equally share the
same overhead associated with web scraping activity that must lay on the
shoulder of the developer. All in all, I wanted to highlight that one
shouldn't put so much effort into creating web scrapers, and suggested a
different tool that is specialized for the same job mentioned in the article.

