I'm looking for a couple of midium-size project ideas that are not just following tutorials.
I'd like to see either more non trivial software/coding skills in getting the data and setting up a good data infrastructure or more depth on a innovative science solution.
I have some trouble just giving you some full/rich idea, since there is a whole world of possibility. However, I can share some heuristics with you that you may find useful.
The first is, do you have any domain knowledge that would lend itself to a data science project? This would be one step in differentiating your idea, and allowing it to build off of existing ideas you have, as opposed to an off-the-shelf classifier project from a data science project site. This could be anything from biomedical data, to sports data, to market data etc. This will let you highlight your ability to dive deep and apply data science tools to a specific problem. Even if I'm interviewing someone who worked with medical data before, their ability to do data research building off domain knowledge is a strong signal that they will be able to do it again in a new domain.
The second is can you get a semi-novel dataset? Even if it's just writing a fully-fledged python script to scrape some APIs or (maybe) web-pages, something that shows that you hunted down data, and wrangled it, as opposed to downloading data_science_project.csv.
Once you get your data, try to think of a properly engineered way to store it. A csv on your laptop isn't always bad, but familiarity with AWS/Azure APIs and storing your data on the cloud in a 'nicer' format (e.g. Parquet) (or if necessary, in a database).
In your code, can you have a lightweight API to retrieve your data? Again, I'd be looking for something that tells me you can get, store, and retrieve data in line with best practices, so if you're hired and there is messiness/challenges with data, you can manage it yourself rather than needing an engineer to do all the work for you, and your job only starting when you have a csv on your local machine.
Once you have all this, can you thoughtfully try out some different methodologies? As well as interesting exploratory data analysis? This part is harder to give concrete recommendations on, but I'd like to see something that considers the problem space, the data type, and chooses the right algorithm. Then for the algorithm you chose, I'd like you to have a medium depth understanding of how it works below the hood. The bad case is you just get some data, throw it at xgboost or a nnet, and say "well I read the API docs and sorta know how they work."
(as a side note, try not to over-complicate the problem. Always do a simple model as well as the exciting model you want to try, because exciting models usually are hard to manage in production)
Lastly, put it on your github, and really highlight it on your resume or in the interview. I often gloss over portfolio project bullet points on a resume, but I'll always check a github if it exists.
Even if the project is half-baked or not as exciting as you want, having concrete github code I can read is worth so so so much more than any coding question I could ever ask.
Finally, my recommendation is for a data scientist generalist type. I do know some data scientists who are extremely valuable, more valuable than I am, who can't do any of that stuff. Usually they just work in a jupyter notebook using data handed to them. In their case it tends to be because they are so talented/trained in, say, deep learning, that their most value to the team is having someone else do everything for them, while they tweak hyper-parameters.
As per if they'll look at it themselves, I'm much more skeptical. The same issue is present (checking at the resume stage or at the first filter stage) with other portfolio pieces though.
You can import - and analyse the OpenStreetMap data, and create some nice QA reports for the community.
I'm working on compiling my findings into a book.
Along with this I am creating a framwork that can be used to
- provide common os support
to the different models.
As such say the Dasher model can be compared more or less scientifically to say plain old keyboard method or why not chorded input
I do this as a portfolio of sorts since it demonstrates such a wide range of knowledge.
Furthermore as we are dealing with a keyboard, that is something that is _always_ in use it's really important to create a wellpolished fast method so it's not for the faint hearted.
I haven’t tried it myself, and it looks more like smaller projects, but someone might find it interesting.
If you want to be an actual scientist then do something thats actually scientific: elaborate an experiment design, collect your own data, analyse it and draw conclusions from it.
For example, what’s the relationship between crime in San Francisco and Starbucks locations? How’s the relationship conditioned on the weather? Does the size of the parking lot adjacent to Starbucks meaningfully effect crime independent of location?
I’m a little biased but there are too many script kiddies. “Scientists” that copy/paste scripts and “analyse” by calling APIs, and don’t know how anything works. Data science ala Kaggle.
Training Data : Wikipedia
What is valuable is often rare. Some skills are common or are just the baseline.
Peculiarities in people are less commodotized, and when these peculiarities intersect with the activity domain of an organization, they become valuable. When these peculiarities are deep enough and span across a broad range and the intersection of that range and the organization's interests is quite large, they become extremely valuable.
These peculiarities are often a result of lifestyle, interests, musing, and wandering. Often acquired through the years on the person's free time and are not taugh in class.
This reads like something new-agey like the saying that goes "Instead of trying to paint a perfect picture, become perfect and just paint"
Now for more practical and less "general" speak... I'll have to bring personal anecdotes which, by definition, are about my specific experience. The pronoun "I" will be used too often for a regular post as a consequence. This serves as an example of what I mean by the above.
The first project I was involved with when I joined my current company as an Enginner was related to heart data. It was convenient that I had worked on heart data before, read a lot of medical papers on the question, worked on anomaly detection, was familiar with PhysioNet data and format but also had worked on local hospital data filled with chest-hair-sweat-and-motion noise and went through the challenges it represented. I could give pointers to good resources on the question to the team, knew health professionals and faculty I was still in contact with, and personal friends who are medical doctors and surgeons I could get insights from (thinking broadly about "data" not just as in digital format and CSV, but network, friends, domain experts, insights gleaned socializing).
Another project the company did was telecom subscribers churn prediction. I was invited to a brainstorming with the team discussing data and interesting features. One of them is standard of living and financial situation. I insisted on getting USSD data from the telecom company in addition to CRM data and surveys. When I was asked what it would tell us, I asked colleagues how frequently they checked their phone balances as employees (with a source of revenue) vs. how often they did as students. They all got the point: as students, it wasn't obvious that you even had enough airtime to make a call or send a text, so you sent a USSD request (free of charge) to see how much airtime you had left (thinking about data from "human moves" perspective and not forgetting the experience of being a broke student for feature engineering). It helped the project that I had gone through some books on GSM and CDMA networks (out of curiosity) and was more fluent in the data the telco sent and their jargon. I could help the team with that, recommending reading sources curated over a long time, insights from personal acquaintances in different roles in the telecom domain (engineering, sales, marketing, etc.).
Another project the company did was on reservoir characterization project for oil and gas. It happened that I had interned for the biggest oil services company in that exact position, read several books on reservoir characterization. I also had exposure to the hardware, the process, the different players and their incentives and went to actual reservoir characterization jobs (it paid to know about oil based muds, boreholes, deviation, cuttings, etc.). It helped by sharing context with the team, knowing what to look for, who to ask and what, where to get data, what domain name was that. I also had friends working in that domain in different geogrpahic locations with different companies.
Another project I was in involved sound. My training was in EE so I had more training in signal processing than the team and also had courses on acoustics. I was able to help with pure signal processing and acoustics, resources to bring someone up to speed, explanations, etc. I had interest and knowledge in the source that was producing the sound. It helped in meetings with the client because the sound source was very peculiar. The client was impressed because they felt I knew more than an outsider should, given regulations and the nature of the source. I was able to handle it safely and use it very accurately to their surprise and to my employer's because I had never talked about it. I also had access to people with much more domain expertise than the client organization giving extremely valuable insights on real world condition and more interesting and frequent access to more diverse data sources. When we had to build custom hardware and mics, it helped that I was comfortable with a soldering iron, too.
When we did a project for a retail organization, it helped that I already was primed because I had gone through their site, read their pages source, knew they were using schema.org ontology, knew how their site was structured, already parsed their sitemap, built a scraper for that site and did all that before joining my employer. Plus I had the code.
Another project in banking where I had also some experience because I got interested in earlier years to how they work, wrote some code for parsing transaction data, also had friends in different banks and financial institutions explaining things (again, data of another nature and from other sources).
Another project was related to data from Programmable Logic Controllers, and it helped that I had read a bunch on the question, tinkered with Siemens PLCs, etc (it also helped when one of our new hires is a student working on a project relating on communication protocols for PLC and finding out during the interview that there's someone in the company who also was familiar, giving pointers, and adding value to his work. It helped make him work here).
Other anecdotes of visiting sites in Russian that were not translatable (images instead of text content), and being unfazed and able to sort of get around because I had tried learning Russian earlier. It wasn't much, but it saved time and just the spirit of "whatever it takes" can be contagious. This was a startup and just the boost in morale or anything that removes or tames obstacles helps.
Serendipity at its finest.
And last but not least, and at the risk of being tacky: being able to communicate with people in writing, face to face, and on the phone is enormously helpful. Having a certain "lifestyle", for lack of a better word, that kept that sword sharp, helped a lot. It also helped being in sales as a college student didn't hurt.
The underlying message is: I think you can build a portfolio based on your interests and I think it helps to cultivate your interests. I think it's nice to be able to work on a Kaggle dataset with clean data in CSV format and nicely labeled images, but it helps to think about data in more ways and keep in mind that it's important to get things done and help others get them done, in any way you can. Data is much more than CSV files and annotated images. The questions to ask are:
- How often do you think you get that kind of data (clean, ready, nicely formatted, with
client being responsive and supporting you)?
- In which ways can you bring more value to your employer by helping getting
things done, often drawing on your previous experience, work, and code in a
domain of interest?
- How can you act as a lever for other team members?
- How can you act as a bridge between stakeholders and do impedance matching to
increase effectiveness of the whole system?
- How do you feel about "business" helps (basic econ, ops management, marketing,
accounting, etc.)? It helps transduce features/bug fixes/refactoring to
business terms stakeholders understand.
- How can you move obstacles as small as a boulder they can be?
Some things I have found useful:
- Maintain a network of interesting and smart people in different domains
(physicians and physicists, chemists, poets, painters, musicians, engineers,
- Reading a lot about a lot.
- Implementing stuff. Getting HTTP 429 and knowing what to do about it. Experimenting. Documenting.
- Helping others be better at what they do, do it better and more profitably.
Connecting people and wanting them to succeed.
Now, if I see that a candidate can hustle, I'd be very interested. I can count on one finger such a candidate, and the kid was snatched faster than I could get to him (and was snatched by an acquaintance working at a top institution with a sorry-not-sorry)