More

hardboiled · on July 5, 2019

Tidyverse code is not write-only; it is designed to be mostly read-only where the level of abstraction is at the level of the domain, which in this case is data frames and data pipelines akin to the same constructs in relational algebra/SQL or data processing (map/reduce).

Correct usage requires learning the right abstractions, but fortunately these abstractions are shared across language communities and frameworks.

If you are doing any data processing work and you do not know about foundational functional concepts ie map/reduce then I would argue the code you write is less readable, overly focused on the idiosyncracies of how to do something rather than how entities and data are related to each other - their logic and structure.

The main optimization is to be correct and expressive of one's intention and purpose. If you need performance use Julia.

thom · on July 5, 2019

It’s very hard to write expressive, readable code that munges some horrible tabular format into another arbitrary tabular format, to be fair. That said it’s true that most R code doesn’t seem like map/reduce, or some other pipeline of transformations. It’s usually more like someone cut and pasted a long, hard REPL session into a notebook and has no intention of ever scrolling back up.

jrumbut · on July 5, 2019

I understand that's the goal, but it hasn't been what I've seen from user's code. I think the issue, which the article describes, is that the tidyverse has too large a surface area for something aimed at very smart people who are not software engineers and don't have time to become one.

So they end up with a subset of functionality they can get stuff done with, but then they need to collaborate with someone who uses a different subset. That can be messy.

I think in an environment where everyone learns R from the same course/book/MOOC/whatever (or have done lots of different sorts of programming) and the organization can impose a style guide the tidyverse approach would be great, but when you have people coming from all sorts of places and backgrounds I don't think it's a good fit.

hardboiled · on July 5, 2019

I concur with your sentiments having cultivated data science teams from the ground up with diverse educational backgrounds.

Programming in base R is more akin to assembly language and has accreted a babel of inconsistencies that make it difficult to teach and learn. Learning base R isolates you into a Galapagos island of academics who are either ignorant of the needs of data workers or too elitist to engage with those not in their priesthood.

Learning Tidyverse is a considerably better transition for learning other languages, frameworks, and libraries.

Functional programming is closer to algebra than indexing into data structures with magic numbers. I've found more success teaching functional pipelines of data structures using the idioms in Tidyverse as a general framework for data work than base R. Abstraction has a cost but for learning it is the appropriate cost.

I sense that much of this `monopolistic` fear mongering is really about feeling out of date.

hardboiled · on May 25, 2019

The Chinese government has prevented plenty of US companies from doing business in China including Google, but more importantly is the rejection of patents and intellectual property and espionage conducted by Chinese government, which has cost US companies billions and the US government hundreds of lives.

pm90 · on May 26, 2019

> which has cost US companies billions and the US government hundreds of lives.

Can you tell me with a straight face that the CIA, NSA etc. do _not_ have active espionage activities in China? Do you think valuable intellectual property and state secrets exist only in the US? That as a result of these activities, the Chinese Government has not also lost lives?

Do not buy so easily into propaganda. Nations will promote themselves as the only "Good" nation to their own people throughout the world. Because the US is a democracy, the US Government has to be very careful and frame the narrative in a way that's not an outright power grab.

hardboiled · on April 17, 2019

This is the best justification for R: Its statistical lineage and adoption by researchers. There's plenty of great thinking manifested in these libraries and its community. This can't be understated.

hardboiled · on April 17, 2019

R is quite good at developing DSLs (quite Lispy), but Julia is even Lispy-er with a more flexible syntax and unicode identifiers. R, definitely, has a great statistical lineage, but that is also unfortunately what limits it from being as expressive - especially regarding new data types. No one in R creates their own domain-specific data types (that are performant as well) whereas this is the norm for the Julia ecosystem.

https://docs.julialang.org/en/v1/manual/metaprogramming/inde...

hardboiled · on April 17, 2019

Speed is not the only computational constraint - the tremendous memory required for very large data sets is often a severe limitation especially within R (and I actually enjoy R).

This is Julia's native, in-language approach (no delegating to C/C++): https://juliadb.org/

hardboiled · on April 17, 2019

https://www.queryverse.org/

fin.

cwyers · on April 17, 2019

That seems to cover the core of the tidyverse, but not the long tail.

uptownfunk · on April 18, 2019

Cool I’ll check this out, thanks!

hardboiled · on Dec 18, 2018

The irony is that if you use Patreon's search functionality to search for slurs and other hate speech you will find plenty more blatant and obvious instances on their own platform.

hardboiled · on May 31, 2017

Agreed. The `raw` data collection should be transparent and in a format that is not locked into Excel spreadsheets.

auxym · on May 31, 2017

I'm no big fan of excel, but the modern xlsx format is "open", in that it's documented (though maybe not perfectly). It certainly is possible to read data from it using a variety of open source packages, from LO Calc to libraries like pyxl.

baldfat · on June 1, 2017

It's not that it's open it is that all the steps are seen and can be repeated. Excel is just manual labor without seeing how the person took the raw data and got to their conclusion.

hardboiled · on April 11, 2017

Reals are not equivalently as unrealizable or dubious as infinitesimals. Infintesimals can be constructively specified as nilpotent/nilsquare entities, numerical entities which when squared equal 0 but where those entities themselves aren't reducible to 0. All of this can be done in a constructive manner avoiding any use of infinity or classical logic that depends upon indirect proofs (ie excluded middle). John Bell's 'Primer of Infinitesimal Analysis' has good details.

The computational techniques from Automatic Differentiation use these types of entities to calculate derivatives exactly without approximating infinite (limiting) processes.

Calculus can be done constructively without infinite limiting processes purely algebraically using these nilpotents. And from a geometric interpretation there is nothing nonsensical about a tangent line to a curve.

Also you don't differentiate numbers (real or rational), you can only differentiate functions.

Also the idea of a actual infinity is a poetic mathematical one. It doesn't have to fit reality. The issue is whether it is useful and to what extent.

hzhou321 · on April 11, 2017

Seems I agree with you (or you agree with mine) :)

I don't have problem with calculus -- or I wouldn't be able to do physics. I am having problem with calculus based on infinity.

> Also the idea of an actual infinity is a poetic mathematical one. It doesn't have to fit reality. The issue is whether it is useful and to what extent.

Well said.

> Also you don't differentiate numbers (real or rational), you can only differentiate functions.

Of course I meant distinction.