professionalguy's comments

professionalguy · on Sept 29, 2022

I was also looking for photos or diagrams of the road. So weird they didn’t include any.

professionalguy · on July 2, 2021

That’s cool and everything, but I don’t know many DS people who still use R. Maybe academics still do?

iafiaf · on July 2, 2021

R is overwhelmingly used in bioinformatics. There is nothing quite like bioconductor. Most new tools/methods (for ex, in the scRNA-seq) release R packages first.

asdff · on July 2, 2021

Well I'd say conda is quite like bioconductor with the ease of installing relevant packages. scRNAseq has popular r packages like seurat but also popular python packages like scanpy.

fatboy93 · on July 4, 2021

I didn't understand your comment, which is probably my fault.

But you can absolutely install many bioconductor packages from conda.

I love using conda as my environment manager rather than compiling and installing 1000p different libraries and tools.

Also, I install mamba for drastically faster resolution of the dependencies.

cardosof · on July 2, 2021

Really depends on the application. For clean, concise and reproducible ad hoc statistical analytics and modelling, there isn't a better tool than tidyverse+tidymodels.

It's a classic case of the best tool for the job. I usually create simple stuff in R and then move to bigger datasets and production in py+spark.

mellavora · on July 2, 2021

tidyverse may be clean, but it is nowhere near as concise as data.table.

data.table is also typically orders of magnitude faster.

cardosof · on July 2, 2021

Thanks for the point of view, I can't argue since I don't really know data.table. Will check out!

stanbiryukov · on July 3, 2021

Check out dtplyr- lazy data.table backend and tidyverse syntax

listenallyall · on July 2, 2021

Although I agree and don't like R very much, I believe ggplot is still the gold standard for creating top-quality visualizations. None of the python (or other language) clones are quite as good. For projects where the end goal is a complex or detailed graph or plot, it's sometimes worth trudging through R to achieve the best final result.

tonyarkles · on July 2, 2021

Yup, not a data scientist but often do data processing to analyze the outcome of experiments (drone-flight related). I'll use Python/Jupyter if there's a significant amount of clean-up that needs to happen, but R/ggplot is unbeatable if I'm trying to look at the data from different perspectives. As an example, I was trying to look at GPS data the other day and ggplot() + geom_point() + geom_density_2d() was an absolutely perfect way to better grok what was going on.

otabdeveloper4 · on July 2, 2021

I've used Python since 1995, so I should be biased, but switching from Python to R is a huge productivity boost - like switching from Excel to Python. R is just years ahead.

rcthompson · on July 2, 2021

R sees significant use both in academic/research settings and industry.

Fomite · on July 2, 2021

It's the dominant language among academic statisticians.

wespiser_2018 · on July 2, 2021

There are a lot of DS folks using it for Bayesian Statistics

legobmw99 · on July 2, 2021

It’s sadly still quite popular in the research world

jazzyjackson · on July 2, 2021

What about it makes you sad?

vore · on July 2, 2021

Not the original poster, but the language has some really weird edges. For instance, check out this wild behavior: http://www.hep.by/gnu/r-patched/r-lang/R-lang_41.html

asdff · on July 2, 2021

This is because R borrows a lot of syntax from S. When R came out, statisticians were using S, so it was natural to make it like this. If they went another way, you'd get statisticians in mailing lists 20 years ago bemoaning how its so much not like familiar S, rather than regular old programmers 20 years later today who bemoan that R isn't like familiar python like what happens on HN whenever there is an R thread.

vore · on July 2, 2021

I think the behavior is so wildly inconsistent that it's not really justifiable, regardless of being a statistician or not: https://github.com/tidyverse/design/issues/13#issuecomment-4...

asdff · on July 2, 2021

I mean compared to other languages these sorts of quirks might seem like big deals, but they rarely come up. You see that error, you copy paste and find a stack overflow thread explaining it, you know what to do next time and move on. R is certainly no C.

kgwgk · on July 2, 2021

As far as weird edges go, that one is really, really mild. It may even be considered a good idea!

For people interested in weirder things, check The R Inferno (I think it's somewhat outdated by now, though):

https://www.burns-stat.com/documents/books/the-r-inferno/

jsmith99 · on July 2, 2021

That book isn't so much about R weirdness. It's more about teaching data scientists to consider the implications of practices like copying a huge table in memory on every loop iteration.

hugh-avherald · on July 2, 2021

Idiosyncrasies are not something unique to R.

One could express the same surprise at an empty list being considered false in some contexts.

f6v · on July 2, 2021

R absolutely dominates some of the life sciences. For example, most of the state-of-the-art bioinformatics tools are in R.

professionalguy · on Feb 18, 2020

>Ideally grad students would stop pouring in, free labor would dry up at universities, and they’ll have to raise grad student salaries to acceptable wages again. Seems unlikely given how badly people want to do anthropology PhDs, and that there’ll always be people who can afford to take a poorly paid position like that because their partner or parents are paying the bills.

I like how you phrased that.

One potential solution is some feedback loop telling students how many PhDs in anthropology we really need. We're currently producing way more PhDs in non-STEM disciplines than there are post doctoral academic positions (I mean that broadly; e.g. post-docs, tenure track, full time lectureships).

Instead of giving out 5 PhD spots, maybe a program can give out 1 PhD spot and pay that student 3x more. They'd save money and could focus on creating one really good professor, instead of 5 struggling lecturers.

paconbork · on Feb 18, 2020

The problem is that having a large body of grad students increases your research output, allowing you to get more grants and raise the prestige of your institution at a relatively low cost. Universities are incentivized to admit more grad students if it leads to these improved outcomes.

GuiA · on Feb 18, 2020

And more students running your labs/substituting you for teaching/etc. means more time to do non-teaching stuff, which is a net positive for most professors.

How sad was I to realize that I was the only one in my PhD program who had chosen that route because I was as passionate about teaching as I was about research!

Feynman:

If you're teaching a class, you can think about the elementary things that you know very well. These things are kind of fun and delightful. It doesn't do any harm to think them over again. Is there a better way to present them? Are there any new problems associated with them? Are there any new thoughts you can make about them? The elementary things are easy to think about; if you can't think of a new thought, no harm done; what you thought about it before is good enough for the class. If you do think of something new, you're rather pleased that you have a new way of looking at it.

The questions of the students are often the source of new research. They often ask profound questions that I've thought about at times and then given up on, so to speak, for a while. It wouldn't do me any harm to think about them again and see if I can go any further now. The students may not be able to see the thing I want to answer, or the subtleties I want to think about, but they remind me of a problem by asking questions in the neighborhood of that problem. It's not so easy to remind yourself of these things.

http://www.math.utah.edu/~yplee/teaching/feynman.html

Well in the end I dropped out (:

repsilat · on Feb 19, 2020

It's not just selfish. If the school admitted fewer anthropology grad students, the marginal rejected candidate would say "Just let me in, I don't need to be guaranteed a postdoctoral position, and I'll work for a pittance -- I just really want to do my PhD." How is not admitting them kinder when they're beating down the doors to be let in and "exploited"?

majormajor · on Feb 19, 2020

From knowing a bunch of STEM PhDs over the past decade, we're producing way more of those than "needed" too. They just have cushier basically-unrelated fallback options.

nostrebored · on Feb 19, 2020

> One potential solution is some feedback loop telling students how many PhDs in anthropology we really need.

Like... a cost associated with the program... and long term potential earnings?

voxl · on Feb 19, 2020

It's so strange to me that the solution here people propose is to limit the free choice of students that want to further their education. A PhD does not have to be a professorship job training program anymore than a bachelor's should be job training. A more educated workforce is a good thing in its own right.

The economics are already there for increased grad students, yet we want to artificially restrict it to prevent what? Sounds more like enforced elitism.

sharkmerry · on Feb 18, 2020

5 PhD spots is 5x the revenue. Theres a reason they dont provide that data to potential students

professionalguy · on Feb 14, 2020

Can someone ELI5?

professionalguy · on Jan 23, 2020

In data engineering, I think your responses to the 'entity resolution' problem are a good Dunning Kurger style litmus test.

If you don't know, entity resolution is the process of matching unique rows in two or more databases. Are these the same movies? Are these the same person?

Novice DE: Oh easy, just merge on the name.

Intermediate DE: OH GOD NO. <michael_scott_no.jpg>

Expert DE: That's complicated, but I have a plan.

skellera · on Jan 23, 2020

Just curious, is there a standard way to start attacking that problem?

AstralStorm · on Jan 24, 2020

You always need some sort of data normalization scheme, and one that makes sense for the task you're running.

(This including things such as Unicode normalization and looking at other fields to determine if it's the same thing.)

And you get to handle duplicates too.

That is just the start, problem gets even more interesting in a real sharded scenario because eventual consistency is hard.

professionalguy · on Jan 3, 2020

Or R1 research universities

professionalguy · on Jan 3, 2020

that's exactly what I was thinking - plus what startup story is complete without some VC intervention?