This is super interesting to me, is there anything else you can share about how you approached this? In my scraping Google experience I have found roughly the same thing where once you've passed the captcha test, you can scrape a lot more.
Were you scraping with real browsers or something like Mechanize/Curl? Rate limiting at all? Proxies or real servers?
We had a pool of IPs we were leasing on our own, proxy services get abused and are poisoned.
We didn't rate limit, we would just increase the size of the cookie pool if a captcha was hit, which was rare because we would scrape n-pages till a threshold was met to prevent that session from being captcha'ed so we wouldn't have to captcha solve it. We had two pools, the primary pool and the "chilling" pool, cookies near their captcha life would cool off for a few hours before returning to the active pool which behaves just like any other resource pool, every page scraped would "borrow" a cookie out of the pool, customize the encrypted location key, and make the request with a common user agent string.
Scaling it was difficult but once we had it figured out, Erlang was invaluable to us and our dependence on IPs dropped once we figured out the cookie methodology.
When you set the location in Google it customizes the cookie with a named field that is a capital L I believe. That field is encoded or encrypted and I could never figure it out so I just constructed a rainbow table by using phantomjs to set a location and scrape the cookie out, pairing the known location value with the encrypted value so that we could customize "the location of the search".
Isn't this just promoting the concept of a well written wiki? Linking documents together IS a graph, whether you store it in a graph db or not. Graph DBs shine when it comes to graph queries, storing documents that are well linked can be handled with any old relational DB
One advantage of a graph database implementation over a traditional wiki implementation is in regard to querying along edges as opposed to following links. Which is to say that the edges of a graph database will tend to have richer semantics than a wiki's generic hyperlinks [but it is not to say that a graph database is necessarily better].
To put it another way, at best a link from a wiki page gets implied importance based on it's position within a document. A graph database edge can have many properties including "weight" or "relevance". Of course these could be added to wiki markup as properties, but there is a point at which the right tool is better than layered make-do's.
If desired, one can even make this easier to parse programmatically:
<a href="foo.html" class="ExampleCode">Hello world example</a>
<a href="bar.html" class="GrammarNotes">Grammar</a>
That allows for styling different types of links differently. Also, a wiki's backend software could extract separate indexes for sample code, grammar fragments, links to compiler source, etc. you could also easily include an attribute for, e.g. Click-through rate, and, if desired, style links accordingly.
In fact, the difference between a classical wiki and a graph database is, IMO, an implementation choice. If you want to process complex queries rapidly or a lot of your attributes aren't textual or aren't intended primarily for visual display, a graph database is more appropriate. If you want to serve standard web pages rapidly, storing the content as HTML may be more appropriate, even if that means that processing queries such as "give me all links to source code examples" run way more slowly.
Could you clarify: What are the nodes and edges? May the graph contain cycles? May the graph be infinite? If the graph is infinite, do we need a distinguished initial node? Do non-isomorphic graphs always correspond to different computations, or do you have a coarser notion of equivalence (say, graph homeomorphism)?
I've read papers where the space of possible computations for a given (typically, toy, even loop-free) nondeterministic concurrent program is represented as a directed topological space (see “directed algebraic topology” for more information), but these spaces are more general than directed graphs.
I don't mean directed graphs, only vertices and edges.
I don't think you would need to deal with infinite graphs, as "arbitrarily large" should be able to cover all (finite) possible computations and/or representations.
My guess is that every graph may correspond to an infinite number (though, not necessarily every) of possible computations and/or representations, depending on which "correspondence function" is being used to analyze the graph (though there may be additional "correspondences" between those functions and/or their results).
Broadly speaking, in graph reduction the nodes are functions and values and the edges are steps in the computation that can be taken to apply functions to those values. After application that node becomes a new value, etc. until the program is evaluated or terminated.
I don't think there's any reason to be judgy about the other commenter. I understand you wanted a rigorous definition but there are better ways to ask for it than making unilateral demands and using negative terms like "disappointing" and "annoying". You may be well versed in these subjects but it is possible to be technically correct and gracious at the same time. I think you're capable of doing so and you would probably find more fruitful discussions by taking that path.
What did you not find useful about my numerous answers?
Basically, everything is in some way equivalent to some graph. Since anything is an example, it's difficult to say much without using an example, which then necessarily constrains the discussion.
It would be more useful if you presented an example computation or representation, then I could show you how to make and reverse an equivalent graph. (Which then might show you what information might be associated with nodes and edges.)
We'd need to be talking about a concrete example to say with much specificity, but a node is basically a "thing" and an edge is a binary relationship between two things. A thing may have relationships with itself.
The precise "information" associated to them could be "anything", but would be proscribed by the computation and/or representation being disccussed and the graph in question.
How disappointing. I was expecting something more concrete, like “every node is a sequence point and every edge is a possible transition” or “every node is a point at which a nondeterministic event occurs and every edge corresponds to a causal relation between events”. But if you just say “a computation is a graph” and give no further details, then there's no actual benefit to modeling computations as graphs.
I'm even more annoyed by your use of the word “isomorphic”, which, FYI, doesn't mean “vaguely similar in some way I can't articulate”. It only makes sense two speak of two mathematical objects being “isomorphic” when they belong in the same category. What category do you have in mind, that includes both computations and graphs as objects, and what are the morphisms between them?
Disappointing? You are the one coming to me expecting an example-oracle that produces them on demand.
Computation / representation:
2 + 2
There is a way to produce the graph from the computation / representation and vice versa. There is a way to do that for every possible computation / representation and every possible graph (though, not necessarily with every pair).
This could be described as a compiler or compiler-like. There are many interesting things to do with them, but in this example all that is needed to consider is the construction of an Abstract Syntax Tree from a string and an in-order tree walker that emits the contents of the nodes to an output string.
(0) You said “I don't mean directed graphs, only vertices and edges.” But a syntax tree, viewed as a graph, is very much a directed one - the relation between a parent node and its children is asymmetric.
(1) A syntax tree isn't the same thing as a computation. Of course, you can produce computations by interpreting syntax trees, but: (a) It's perfectly possible for two distinct syntax trees to produce the same computation. [Say, by renaming all variables.] (b) It's also perfectly possible that interpreting the same syntax tree twice will produce completely different computations! [Say, if your language isn't pure.] Unsurprisingly, a syntax tree is a representation of syntax, not computation.
(2) Yes, I'm disappointed, because I expected your observation that “every computation (...) is isomorphic[sic] to some graph” to provide more insight than it turned out to.
Placing [sic] into a paraphrasing is just misleading.
0) You asked, "What information is associated to [nodes and edges]?". In this example, one piece of information is the direction of an edge.
1) I don't think I claimed they were the same. In fact, I am claiming that one can be represented by the other and vice versa. a) yes, that is part of the reason I conject that each graph may be associated with an "infinite number" of computations and/or representations b) also a reason -- there are perhaps an infinite number of languages, interpreters, compilers, etc. (some of which may produce the same result) for a given graph.
2) It was only an observation that the comment I responded to claimed only a subset of the truth. Anything done on a computer can be considered as a graph.
The point isn't necessarily one graph representation, but simply that any problem can be looked at in a graphy manner if you try hard enough, so "hey, it's a graph" isn't an interesting insight. But to answer your question, data flow graphs are a general representation of function composition, and Turing machines can be looked at as graphs with labeled edges. Yes, we're using "isomorphism" in the pop-compsci, Gödel-Escher-Bach sense.
A TV network's product is television content, their monetization strategy is advertising. The delivery mechanism is cable / airwaves.
Sure there are secondary economic effects and markets that complicate the picture. But the basic business model of a television network isn't all that different than a regular company that sells their products directly.
So AMC and HBO are making different things? No, of course not. They both produce content and distribute it through cable networks. One of them charges advertisers to deliver the product to viewers, while the other one charges the viewers directly.
To be fair, if you're an advertiser, then the product is eyeballs. But if you're a pair of eyeballs, then the product is the content. They're not mutually exclusive. Same with google. If you're a searcher, then the product is results. If you're an advertiser, then the product is relevant searchers.
I've seen your timeblock.com site a few times because of these threads, and every time I click on the link and end up back on your landing page I get so frustrated because your site does such an amazing job of not actually saying anything about the product. What is your product? What does it look like? How do I use it? How does it compare to other products? Why would I pay for it? What am I paying for?
I want the first "stable" release to be 3.0.0, which is lexicographically before 3.0.0-beta. Admittedly this is only a limitation of Sparkle, but I just haven't prioritized modifying Sparkle above fixing real bugs.
SemVer works great for libraries, but it falls apart for actual software IMO. How would you define a "breaking change" in a software ? Would it be a complete redesign ? Or maybe a feature removal (Does that ever happen ?)
Versions are supposed to convey meaning over how much has changed to the app between two versions. iTerm got soooooo many new features in this new version, anything smaller than a major bump would feel out of place.
I do agree the actual version should be v3.0.0-beta1 tho ^^.