Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Data Science portfolio projects ideas
76 points by koots on June 3, 2019 | hide | past | web | favorite | 27 comments

I'm looking for a couple of midium-size project ideas that are not just following tutorials.

My two cents as someone who interviews tons of data scientists is that most portfolio projects are way too easy, and amount to getting generally clean data, then just calling some API from sklearn or tensorflow.

I'd like to see either more non trivial software/coding skills in getting the data and setting up a good data infrastructure or more depth on a innovative science solution.

This is exactly why I'm asking. So, since you are, appearantly senior, in the field, do you have any concrete recommendations?

Sure. Let me start by saying that the data science interview/competency game isn't common knowledge. That is to say, even within our own company, where we are supposed to have standard guidelines, different organizations have different perspectives. So I'm sure other data scientists may disagree with me.

I have some trouble just giving you some full/rich idea, since there is a whole world of possibility. However, I can share some heuristics with you that you may find useful.

The first is, do you have any domain knowledge that would lend itself to a data science project? This would be one step in differentiating your idea, and allowing it to build off of existing ideas you have, as opposed to an off-the-shelf classifier project from a data science project site. This could be anything from biomedical data, to sports data, to market data etc. This will let you highlight your ability to dive deep and apply data science tools to a specific problem. Even if I'm interviewing someone who worked with medical data before, their ability to do data research building off domain knowledge is a strong signal that they will be able to do it again in a new domain.

The second is can you get a semi-novel dataset? Even if it's just writing a fully-fledged python script to scrape some APIs or (maybe) web-pages, something that shows that you hunted down data, and wrangled it, as opposed to downloading data_science_project.csv.

Once you get your data, try to think of a properly engineered way to store it. A csv on your laptop isn't always bad, but familiarity with AWS/Azure APIs and storing your data on the cloud in a 'nicer' format (e.g. Parquet) (or if necessary, in a database).

In your code, can you have a lightweight API to retrieve your data? Again, I'd be looking for something that tells me you can get, store, and retrieve data in line with best practices, so if you're hired and there is messiness/challenges with data, you can manage it yourself rather than needing an engineer to do all the work for you, and your job only starting when you have a csv on your local machine.

Once you have all this, can you thoughtfully try out some different methodologies? As well as interesting exploratory data analysis? This part is harder to give concrete recommendations on, but I'd like to see something that considers the problem space, the data type, and chooses the right algorithm. Then for the algorithm you chose, I'd like you to have a medium depth understanding of how it works below the hood. The bad case is you just get some data, throw it at xgboost or a nnet, and say "well I read the API docs and sorta know how they work."

(as a side note, try not to over-complicate the problem. Always do a simple model as well as the exciting model you want to try, because exciting models usually are hard to manage in production)

Lastly, put it on your github, and really highlight it on your resume or in the interview. I often gloss over portfolio project bullet points on a resume, but I'll always check a github if it exists.

Even if the project is half-baked or not as exciting as you want, having concrete github code I can read is worth so so so much more than any coding question I could ever ask.

Finally, my recommendation is for a data scientist generalist type. I do know some data scientists who are extremely valuable, more valuable than I am, who can't do any of that stuff. Usually they just work in a jupyter notebook using data handed to them. In their case it tends to be because they are so talented/trained in, say, deep learning, that their most value to the team is having someone else do everything for them, while they tweak hyper-parameters.

Thank you very much. I really like your advice. I would like to build on my domain knowledge. I'm a PhD student in cs focusing on algorithms.

While I’m actually that guy who tweaks DL hyperparameters, this is an excellent advice. These are the things I’d be looking for in a DS candidate.

My personal data science blog (https://minimaxir.com/) was designed over the last few years specifically as a portfolio piece to get a data science job by creating advanced data visualization/analysis projects w/ code, and I was eventually successful. (it has done well on HN when it pops up too)

Just a question: why do you use the third person when describing yourself? I've seen it in resumes and some formal settings, but never on a personal homepage. Also I find it a bit odd.

Consider making tutorials for publications that you're interested in, but are nontrivial to read through and understand. Building a tutorial will give yourself a deeper understanding of a problem, help you communicate that understanding, and benefit the larger community.

Thank you. Do you think this will be appreciated for a job interview? I think it will take lots of time and effort for that to happen.

Yes. Part of the job interview will be selling your knowledge through past work. Spending the time to tutorialize content will make it easy and stress free to talk about it to a general audience. Being able to communicate your point clearly is fairly essential IMO.

As per if they'll look at it themselves, I'm much more skeptical. The same issue is present (checking at the resume stage or at the first filter stage) with other portfolio pieces though.

I’ve had fun and built interesting projects based on harvesting tweets. As some other comments suggest, data collection is an important and hard skill that most tutorial projects ignore. If you can show 0) you came up with an interesting question, 1) had the idea to get this data, 2) harvested the data successfully, 3) formatted and cleaned it, and 4) ran appropriate algorithms to look for the answer to your question, you have everything you need from a portfolio project (even if the data doesn’t support your original hypothesis!).

OpenStreetMap (GIS, OpenData, Humanitarian, Visualisation )

You can import - and analyse the OpenStreetMap data, and create some nice QA reports for the community.

Arxiv: https://arxiv.org/search/?query=openstreetmap&searchtype=all...

I really find this interesting.

Perhaps give me a hand researching text input, I'm starting to gather a rather large source of ideas to be implemented. Fun work really allows you to think outside of the box. Spans all the way from UI-design down to system calls into the os ia C. So there area lot of areas to cover. Let me know by commenting if you are interested.

Please elaborate. What are you researching about text input?

This is the reasoning behind the project. In short I'd say that with VR around the corner and touchscreen everywhere but no practical way of doing work as such on smarthones tablets the way we look at text input should be reconsidered. International input and accomodating for disabled people are other angles. tbf-rnd.life/blog/2019/05/21/hello-world

I'm working on compiling my findings into a book. http://tbf-rnd.life/the-book-of-lesser-known-input-methods/

Along with this I am creating a framwork that can be used to - demonstrate - benchmark - provide common os support - provide to the different models.

As such say the Dasher model can be compared more or less scientifically to say plain old keyboard method or why not chorded input

So there are plenty of interesting concepts to dig into here. From UI design in javascript to analysis on text a lot of word2vec and other fun stuff and why not 3d interfaces.

I do this as a portfolio of sorts since it demonstrates such a wide range of knowledge.

Furthermore as we are dealing with a keyboard, that is something that is _always_ in use it's really important to create a wellpolished fast method so it's not for the faint hearted.

I have actually done some work with doc2vec. What is in these text data? What do you hope to extract?

I heard that a good Kaggle profile is a data science equivalent to a good GitHub profile.

I haven’t tried it myself, and it looks more like smaller projects, but someone might find it interesting.


whats your background? And what sort of job are you looking to get from it?

If you want to be an actual scientist then do something thats actually scientific: elaborate an experiment design, collect your own data, analyse it and draw conclusions from it.

For example, what’s the relationship between crime in San Francisco and Starbucks locations? How’s the relationship conditioned on the weather? Does the size of the parking lot adjacent to Starbucks meaningfully effect crime independent of location?

I’m a little biased but there are too many script kiddies. “Scientists” that copy/paste scripts and “analyse” by calling APIs, and don’t know how anything works. Data science ala Kaggle.

I'm a phd student in cs focusing on algorithms. I want to work as software engineer/data scientist.

You want your portfolio to communicate that you are fluent in data, resourceful, and think broadly on what data is. More generally, you want to communicate that you are valuable to the employer.

What is valuable is often rare. Some skills are common or are just the baseline.

Peculiarities in people are less commodotized, and when these peculiarities intersect with the activity domain of an organization, they become valuable. When these peculiarities are deep enough and span across a broad range and the intersection of that range and the organization's interests is quite large, they become extremely valuable.

These peculiarities are often a result of lifestyle, interests, musing, and wandering. Often acquired through the years on the person's free time and are not taugh in class.

This reads like something new-agey like the saying that goes "Instead of trying to paint a perfect picture, become perfect and just paint"

Now for more practical and less "general" speak... I'll have to bring personal anecdotes which, by definition, are about my specific experience. The pronoun "I" will be used too often for a regular post as a consequence. This serves as an example of what I mean by the above.

The first project I was involved with when I joined my current company as an Enginner was related to heart data. It was convenient that I had worked on heart data before, read a lot of medical papers on the question, worked on anomaly detection, was familiar with PhysioNet data and format but also had worked on local hospital data filled with chest-hair-sweat-and-motion noise and went through the challenges it represented. I could give pointers to good resources on the question to the team, knew health professionals and faculty I was still in contact with, and personal friends who are medical doctors and surgeons I could get insights from (thinking broadly about "data" not just as in digital format and CSV, but network, friends, domain experts, insights gleaned socializing).

Another project the company did was telecom subscribers churn prediction. I was invited to a brainstorming with the team discussing data and interesting features. One of them is standard of living and financial situation. I insisted on getting USSD data from the telecom company in addition to CRM data and surveys. When I was asked what it would tell us, I asked colleagues how frequently they checked their phone balances as employees (with a source of revenue) vs. how often they did as students. They all got the point: as students, it wasn't obvious that you even had enough airtime to make a call or send a text, so you sent a USSD request (free of charge) to see how much airtime you had left (thinking about data from "human moves" perspective and not forgetting the experience of being a broke student for feature engineering). It helped the project that I had gone through some books on GSM and CDMA networks (out of curiosity) and was more fluent in the data the telco sent and their jargon. I could help the team with that, recommending reading sources curated over a long time, insights from personal acquaintances in different roles in the telecom domain (engineering, sales, marketing, etc.).

Another project the company did was on reservoir characterization project for oil and gas. It happened that I had interned for the biggest oil services company in that exact position, read several books on reservoir characterization. I also had exposure to the hardware, the process, the different players and their incentives and went to actual reservoir characterization jobs (it paid to know about oil based muds, boreholes, deviation, cuttings, etc.). It helped by sharing context with the team, knowing what to look for, who to ask and what, where to get data, what domain name was that. I also had friends working in that domain in different geogrpahic locations with different companies.

Another project I was in involved sound. My training was in EE so I had more training in signal processing than the team and also had courses on acoustics. I was able to help with pure signal processing and acoustics, resources to bring someone up to speed, explanations, etc. I had interest and knowledge in the source that was producing the sound. It helped in meetings with the client because the sound source was very peculiar. The client was impressed because they felt I knew more than an outsider should, given regulations and the nature of the source. I was able to handle it safely and use it very accurately to their surprise and to my employer's because I had never talked about it. I also had access to people with much more domain expertise than the client organization giving extremely valuable insights on real world condition and more interesting and frequent access to more diverse data sources. When we had to build custom hardware and mics, it helped that I was comfortable with a soldering iron, too.

When we did a project for a retail organization, it helped that I already was primed because I had gone through their site, read their pages source, knew they were using schema.org ontology, knew how their site was structured, already parsed their sitemap, built a scraper for that site and did all that before joining my employer. Plus I had the code.

Another project in banking where I had also some experience because I got interested in earlier years to how they work, wrote some code for parsing transaction data, also had friends in different banks and financial institutions explaining things (again, data of another nature and from other sources).

Another project was related to data from Programmable Logic Controllers, and it helped that I had read a bunch on the question, tinkered with Siemens PLCs, etc (it also helped when one of our new hires is a student working on a project relating on communication protocols for PLC and finding out during the interview that there's someone in the company who also was familiar, giving pointers, and adding value to his work. It helped make him work here).

Other anecdotes of visiting sites in Russian that were not translatable (images instead of text content), and being unfazed and able to sort of get around because I had tried learning Russian earlier. It wasn't much, but it saved time and just the spirit of "whatever it takes" can be contagious. This was a startup and just the boost in morale or anything that removes or tames obstacles helps.

Serendipity at its finest.

And last but not least, and at the risk of being tacky: being able to communicate with people in writing, face to face, and on the phone is enormously helpful. Having a certain "lifestyle", for lack of a better word, that kept that sword sharp, helped a lot. It also helped being in sales as a college student didn't hurt.

The underlying message is: I think you can build a portfolio based on your interests and I think it helps to cultivate your interests. I think it's nice to be able to work on a Kaggle dataset with clean data in CSV format and nicely labeled images, but it helps to think about data in more ways and keep in mind that it's important to get things done and help others get them done, in any way you can. Data is much more than CSV files and annotated images. The questions to ask are:

- How often do you think you get that kind of data (clean, ready, nicely formatted, with client being responsive and supporting you)?

- In which ways can you bring more value to your employer by helping getting things done, often drawing on your previous experience, work, and code in a domain of interest?

- How can you act as a lever for other team members?

- How can you act as a bridge between stakeholders and do impedance matching to increase effectiveness of the whole system?

- How do you feel about "business" helps (basic econ, ops management, marketing, accounting, etc.)? It helps transduce features/bug fixes/refactoring to business terms stakeholders understand.

- How can you move obstacles as small as a boulder they can be?

Some things I have found useful:

- Maintain a network of interesting and smart people in different domains (physicians and physicists, chemists, poets, painters, musicians, engineers, teachers, bankers)

- Reading a lot about a lot.

- Implementing stuff. Getting HTTP 429 and knowing what to do about it. Experimenting. Documenting.

- Sharing.

- Helping others be better at what they do, do it better and more profitably. Connecting people and wanting them to succeed.

Now, if I see that a candidate can hustle, I'd be very interested. I can count on one finger such a candidate, and the kid was snatched faster than I could get to him (and was snatched by an acquaintance working at a top institution with a sorry-not-sorry)

+1 . This post was really useful and detailed. Thanks.

Thank you for your detailed advice.

Nice post and sounds like a screenplay of Slumdog Millionaire.

If you are interested in NLP, Entity detection/Classification in news articles could be an interesting place to start.

Training Data : Wikipedia

Thank you for your concrete suggestions.

I've got some semi-clean data that needs crunching, part of a soon to be GPL project, it's actually pretty plain but can give you stuff to blog about and post on your GitHub. My handle at gmail.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact