Hacker Newsnew | comments | show | ask | jobs | submit | IanOzsvald's commentslogin

HotPy is Mark Shannon's research project, he's not making a 'new better Python for the masses', he's using it to test his assumptions about ways CPython could be improved. Ask him about it if you bump into him at a python conference!

reply


Thanks, it's more clear now. I misunderstood Pyston's motives and thought it was just making Python faster rather than writing it from stratch.

reply


The recent Py2 vs Py3 survey over Christmas suggests that approx. 32% of respondents write in Python 3 (increasing on last year), 68% write in Python 2 (decreasing on last year). Py3 usage is up approx. 12% on last year's survey. For personal projects Py2 and Py3 have roughly equal popularity: http://www.randalolson.com/2015/01/30/python-usage-survey-20...

At monthly PyDataLondon meetings I remind the audience to switch to Py3 (a few do each month) as Py2's sunset date is less than 5 years away now.

reply


Yes, but the "survey" has a terrible bias of people who actually care enough to go and respond to such survey. It also means overrepresentation of python-dev people and so on. There is a very heavy bias towards Python 3 in such a survey (as opposed to say pypi download stats)

reply


Hey fijal. Agreed that the survey has a self-selecting audience, I'd also argue that they're the more forward-thinking folk rather than jobbing background users.

Back in April 2013 (the last time I saw python.org download stats - where did they go?!) I wrote a blog post noting that fresh downloads of Windows Python 3.3 were greater than downloads of Windows Python 2.7, for 3 months running. Windows is useful as Python isn't bundled (unlike e.g. Linux and Mac). I presume this trend has continued but have no firm evidence either way: http://ianozsvald.com/2013/04/15/more-python-3-3-downloads-t...

What do the PyPI stats say?

reply


I use a Dell E6420, as a data scientist I kept breaking the default 8GB so upgraded to 16GB which is almost always fine. I won't go >8GB most days but when I do, I need it (else I lose hours trying to partition data and think about ways to subselect). 16GB for data science seems a sensible minimum for the folk I know. I'm on Linux Mint 17 (Ubuntu 14.04) + Python.

reply


Don't forget that the higher level functionality (e.g. the scikit-learn routines Radim uses) are typically wrappers for underlying C/Fortran routines and they're the real bottleneck. The relatively few lines of VM'd Python are 'slow' compared to e.g. C but aren't the bottleneck.

The win with Python (and other dynamic languages) is that you can experiment quickly with ideas when you're formulating a solution, that's a big part of exploratory data science.

If you're curious about high-speed work in Python - Radim did a blog series on how he sped up word2vec to be faster than Google's original C code: http://radimrehurek.com/2013/09/deep-learning-with-word2vec-...

I'll also note [self promo!] that I wrote on book on High Performance Python, if that's your cup of tea (and Radim wrote a section in it): http://shop.oreilly.com/product/0636920028963.do

-----


(tutorial author here) Good answer, and I can only recommend Ian's book!

I cut the marketing speak down to minimum in my articles and tutorials, but if you're interested in cutting edge machine learning & no-nonsense data mining, get in touch! I run a world class consulting company, http://radimrehurek.com.

-----


The win with Python (and other dynamic languages) is that you can experiment quickly with ideas when you're formulating a solution, that's a big part of exploratory data science.

And in my experience, very hard to reproduce after a couple of years. With enough discipline, it's obviously possible to make well-structured Python programs that will last. But in practice that rarely happens with scientific software written in Python. Usually, there are many external dependencies, it's fragile (no static type checking), and platform-dependent (usually OS X or Linux). To add to the mess, most scientists like to hardcode paths to the input data, etc.

Although I am not a fan of Java, I usually don't encounter the same problems with older scientific Java software. If it's Mavenized you are usually ready to go after a 'mvn compile', otherwise, you just dump the project structure in an IDE and it usually works.

(The plague with scientific software in Java is that it is often not thread-safe.)

Also, I think the quick experimentation is not limited to Python and statically typed languages with a REPL can also provide that (Haskell, OCaml, Scala). And since Go was mentioned: since compilation time in Go is usually near-zero, it's the same.

-----


> And in my experience, very hard to reproduce after a couple of years.

Well, let's be honest with ourselves... this isn't limited to Python. Scientific code that isn't a mess is almost nonexistent. For a lot of scientists, writing code is totally secondary and many simply aren't skilled programmers (nor should we necessarily expect them to be).

It is however deeper than that. As a graduate student, I was involved in a government initiative to write a high quality large scale code package. This was (still is, the program just got extended) a well funded and well organized effort with hundreds of people, including literally dozens of people who can legitimately claim to be the best in the world at their specialties. This included some genuinely amazing computer scientists and software engineers who enforced well planned coding practices.

And yet, the code is still far from ideal. A big part of this is its scale - millions of lines of very technical numerics code and libraries all working together. Most of what I consider to be the toughest work was on integrating various disparate pieces and unifying them under one common input structure.

Point being, even with effectively unlimited resources using rigorous development standards and statically typed languages (primarily c++11) there are still tons of issues. A lot of it is because of incorporation of older codes, which is inescapable in any non-trivial scientific code.

-----


I'll also note [self promo!] that I wrote on book on High Performance Python

I've really enjoyed this book so far, so thanks!

-----


Glad you enjoyed it :-) If you have a moment, leaving a review (e.g. on Amazon) would be most appreciated (there's a dearth of views as it is a bit of a niche subject!)

-----


Nice! Just bought your book! :)

-----


Also you've given some amazing talks in various PyCons!

-----


Much obliged (assuming that's directed at me!) - Radim's started with some rather nice talks too :-)

-----


Consider Wikipedia->DBpedia->Wikidata->Wikipedia. First we had semi-structured human-readable information expressed using mediawiki's markup which was hard to parse automatically. Next the DBpedia project (and YAGO) created tools to parse mediawiki and extract facts as triples (e.g. "this thing" "is-a" "company"), they encountered many alternate ways of expressing the same information (e.g. date variants, weights and measures).

Now the Wikidata project (2012) is normalising the data in Wikipedia so that projects like DBpedia have an easier time with the raw information (no need to write alternative parsers for dates, weights, measures and simple facts!). As a result we've gone from human-readable information to machine-readable semantic-web-like information which is accessible via Linked Open Data.

Maybe the driver for semantic web data is humans trying to programmatically consume human-readable information, rather than the other way around?

-----


Working on the UniProt RDF I find that everytime we make it easier for semantic tools to work with our data it actually improves the usability for human users as well.

The RDF part of the semweb idea encourages us to be extremely explicit with what we mean with our data. This helps our end users because it removes a lot of guess work. What was obvious for us as maintainers is not obvious at all for the biologists who need to do stuff with our data. e.g. http://www.uniprot.org/changes/cofactor (going to be live soon) it's a small change from textual descriptions of which chemicals are cofactors for enzymatic activity to using the ChEBI ontology. This allows us to better rendering (UI) and better searching. It also makes the difference clear between cofactor is any Magnesium or cofactor is only Magnesium(2+).

In the life sciences and pharma semweb has a decent amount of uptake. For the very simple reason that this branch deals with a lot of varied information and often mixes private and public data. RDF makes it cheaper for organisation to deal with this.

SPARQL the query language has a key feature that no other technology has in the same way. Federalised queries: if I am in a small lab I can't afford to have datawarehouse of UniProt, it would cost me 20,000 euro - 30,0000 euro just to run the hardware and maintain it. As a small lab I can use beta.sparql.uniprot.org for free and still combine it with my and other public data for advanced queries. Sure uniprot has a good rest interface but it is limited in what you can do with it in ways that SPARQL never will be.

SPARQL is only interesting as a query language since last year. Schema.org is only interesting since last year. JSON-LD is only interesting since last year. Semweb is finally growing into its possibilities and making real what was promised 17 years ago now.

Of course even in the life science domain many developers don't know what one can do with semweb tech, and semweb marketing is no where as effective as e.g. MongoDB or even Neo4J is. So uptake is still slow but it is accelerating!

-----


Here's something I think should be easy: I want to count the events (by month) on all of Wikipedia's 'year' pages.[1]

First problem: multiple levels of indent, so there needs to be logic for that. Second problem: some pages have weird stuff in the events section, like "world population".[2] Third problem: some pages have "date unknown" events.[2] Fourth problem: all the other problems I'd encounter if I looked at more than two of these pages.

So it's a day's worth of work to get something rough and 2-3 days to get something solid. To answer one question of this kind.

Does DBpedia help me?

[1] e.g. https://en.wikipedia.org/wiki/1999 [2] https://en.wikipedia.org/wiki/1955

-----


I've just finished writing "High Performance Python" for O'Reilly (due August), we have a chapter on Lessons from the Field and one chap talks about his successful many-machine roll out of a complex production system using PyPy for a 2* overall speed gain. We also cover Numba, Cython, profiling, numpy etc - all the topics you'd expect.

-----


Martin gave this paper as a talk at our PyData London conference this weekend (thanks Martin!), videos will be linked once we have them. He shares hard-won lessons and good advice. Here's my write-up: http://ianozsvald.com/2014/02/24/pydatalondon-2014/

-----


HackerNewsLondon will occur tonight (in about an hour) http://www.meetup.com/HNLondon/ and if you turn up and blag your way in (you're meant to buy a ticket in advance, offer cash on the door to pay for your pizza!) you'll see companies talking and often talking about jobs. Just network like crazy, most people there will know people hiring. Everyone is friendly, the organisers try to help startups. Go speak to co-organiser Steve (he's an ex recruiter), he'll know who is looking.

http://www.techcityjobs.co.uk/ has a lot of London-focused tech jobs, lots of Ruby etc. Ruby is definitely in demand in London. Avoid being a waiter if you can, your skills will rust.

http://siliconmilkroundabout.com/ occurred last weekend, over 100 startups were there pitching for jobs, go through the list of companies and cherrypick, then drop them a line? The CTOs I spoke with noted the lack of Rails folk. Most of the companies had 2-8 job openings each.

Bon chance, i.

-----


Also, http://hackerjobs.co.uk/ run by Peroni (aka Steve Buckley).

http://www.jobserve.com/ is also decent...

-----


I'm working on an O'Reilly book using IPython Notebook, I'm basing my workflow on Olivier Grisel's script (with a couple of tiny fixes): https://github.com/ianozsvald/ipynbhelper which extracts the Notebook's 'markdown' blocks (which contain asciidoc which obviously won't render in the browser) & code blocks, these get exported as asciidoc for O'Reilly.

It is clunky but I'm hoping we'll get better control as nbconvert evolves, so we're experimenting with this approach.

Most of the code examples are not 'live', they're pasted in along with analysis results (the book is about high performance and parallel computing: http://shop.oreilly.com/product/0636920028963.do ), as lots of the examples are best run from a fresh VM.

-----


Not entirely true - my Dell E6420 is certified and yet had complications (poor Suspend support, touchpad had only basic support [now fixed 2 years later]). The NVIDIA GPU was supported as was sound/internet etc. I used the official Dell Ubuntu drivers and they didn't solve the original problems (much hacking did...).

-----

More

Applications are open for YC Summer 2015

Guidelines | FAQ | Support | Lists | Bookmarklet | DMCA | Y Combinator | Apply | Contact

Search: