Ask HN: What problem are you close to solving and how can we help?

jordwest · on Aug 29, 2021

I want to bring back old school distributed forum communities but modernise them in a way that respects attention and isn’t a notification factory.

Mastodon is a pretty inspirational project but the Twitter influence shows, I miss the long form writing that was encouraged before our attention spans were eroded.

Not at all close to solving it, but it’s been on my mind for a long time. Would love to hear if there are others like me out there and what you imagine such a community to look like.

101008 · on Aug 29, 2021

Is this a software problem? There are a lot of open sources (with their differences) platforms. From old forums, to more modern, etc. I think the problem is that people don't want or don't know how to use a forum.

This is based in my experience:

- Old people: They started to use internet recently so they are used to social networks (Facebook, Instagram), and newspapers websites.

- Young people: Hard to make them use the browser, if there isn't an app (Instagram, TikTok), you are lost. If they want to discuss a topic, mostly Twitter through hashtags, YouTubers or Discord.

- Adult people: This is where some of them may use a forum, but you have to be lucky enough to find adult people who use internet many years ago, and they know what is a forum. If you find a 30 or 40 years old who started to use internet 5 years ago (which happens), you are lost.

And on top of that, you need to compete against Reddit and their own subreddits.

(Edited to format)

koheripbal · on Aug 29, 2021

The value of a forum is not to get everyone. It is actually in limiting a community to the most mature voices and thoughtful people.

There's little upside to having teenagers in a forum, for example, unless you're looking to monetize.

TheAceOfHearts · on Aug 29, 2021

Younger people can help evolve topics by providing a set of fresh eyes. It's also important to pass knowledge on, otherwise it just ends up dying with you.

For example: Virtual reality headsets were fairly stagnant until a young guy in his early 20s tried something new.

Old people often become stuck in their ways and it requires someone new to show up and ask how the sausage is made.

koheripbal · on Aug 30, 2021

"fresh eyes" can still be the people new to the form who have that year just turned 35 (eg).

wizzwizz4 · on Aug 29, 2021

I was moderating forums when I was twelve. While I experience the usual “I was so embarrassing back then” when thinking back to those times, I do think my contributions were positive and well-received.

pjlegato · on Aug 29, 2021

I used many old school forums in the pre-social media days. "Mature voices and thoughtful people" were quite rare.

easrng · on Aug 30, 2021

If you think young people won't use forums, you clearly haven't seen Scratch's. (https://scratch.mit.edu/discuss/)

csydas · on Aug 29, 2021

>- Young people: Hard to make them use the browser, if there isn't an app (Instagram, TikTok), you are lost. If they want to discuss a topic, mostly Twitter through hashtags, YouTubers or Discord.

I disagree on this take.

GenZ can do long-form discussion and they use forums frequently, but for specific discussion(s).

Content consumption is done via forever-scroll apps because it's good to kill time; Tiktok, Instagram, etc match this well because it's just a stream of things to have fun with.

But GenZ is making great "longford" content on many traditional platforms, even blogging. The difference, as I understand it, is their approach to engagement. GenZ has seen Facebook arguments and said "no thank you". Twitter discourse also isn't really a thing -- people disagree and Tweet at people, but Twitter isn't like a forum thread and there's no way to ensure your content is associated with the content you want to respond to, so it's not effective to communicate in responses. Reddit is kind of a mystery for me as I just don't frequent it at all, but I don't get the impression that GenZ is posting frequently.

GenZ has platforms that work best when you make a statement, not where you open a discourse; Twitter is too fast/broad to respond to all comments and find real content to respond to, video content just isn't great for back and forth and becomes time consuming for lighter topics. Viewers will make whatever statement they want, but the validity of the video is based on how far a concept is spread; I'd actually position GenZ is very good at concisely expressing its idea in a simple and condensed format, and responses are not at an individual, but an idea. But, they will go to Tumblr/Medium/other long-form post when the medium is appropriate. This is one thing I like about a lot of GenZ content because they tend to be VERY good about choosing the right format for their argument; that many don't have much more to express besides tweets/tiktoks isn't an indictment of GenZ, it's praise. Think of the forums you maybe still lurk around and how many comments are just complete garbage/non-sequiturs; for forums such a post are enough to derail a topic/distract because we feel obligated to a degree to respond, but filtering noise is part of the skill of using more modern platforms.

Forums are kind of contrary to this, and also are bogged down by the before-mentioned Facebook-argument issue and the preferred platforms not really being strongest for direct rebuttals. Again, there are times when you'll see GenZ use forums or other long-form posts, but it tends to be more controlled or on 'forums-but-not-really' forums like Tumblr.

Forums have their purpose and use; but we have __many__ alternatives that make forums defunct, as some topics/ideas are far better expressed on Tiktok or Twitter, and whatever followup we end up with.

inamiyar · on Aug 29, 2021

this vaguely aligns with my experiences

lonk11 · on Aug 29, 2021

I am building a project (https://linklonk.com) that does information discovery in a way that respects your attention. In short, when you upvote content you connect to other people upvoted that content and to feeds that posted that content. So to get your attention other users need to prove to be a good curator of content for you.

I'm planning to do a "Show HN" post next week for it and would appreciate any feedback that I could address before it. We have about 4 active users and a couple more would be great.

dmje · on Aug 29, 2021

Really like this idea, something I've been thinking about for a while. Will join and give the tyres a kick :-)

lonk11 · on Aug 29, 2021

Thanks! It looks like 13 people signed up, which is really encouraging.

I wrote about the performance tuning I did in preparation to the Shown HN post: https://linklonk.com/item/277645707356438528

If you have any feedback please add a comment to that post.

dmje · on Aug 30, 2021

Hey, site down?

lonk11 · on Aug 30, 2021

Yes, I managed to delete the "A" DNS record last night when I was adding records for mail hosting linklonk.com. Sorry! It is up now.

lonk11 · on Sept 3, 2021

The "Show HN" post is up: https://news.ycombinator.com/item?id=28405643

sthatipamala · on Aug 31, 2021

Hmm... did you happen to work at Google? The concept and the UI seems very familiar.

wizzwizz4 · on Aug 29, 2021

Bug report: the UI on interacting with posts in the “From feeds and users that recommended this” section is broken.

lonk11 · on Aug 29, 2021

Thank you! Indeed, the upvote button on those item-based recommendations didn't work. It is fixed now.

ammar_x · on Aug 29, 2021

Seems good but the UI can be improved IMO.

I have two questions please:

- Is there a list of all the feeds used by LinkLonk?

- How many users are there today?

lonk11 · on Aug 29, 2021

Do you have any specific suggestions on how to UI? Small tweaks would be most appreciated as I could implement them before the upcoming Show HN post.

To answer your questions:

1. The list of all feeds used by LinkLonk is not publicly accessible through the website. They are feeds that users explicitly submitted through https://linklonk.com/submit or feeds that LinkLonk parsed from the meta tags of the links that users submitted.

2. The number of active users has been about 4 for the last few months. I'm hoping to get it to 10 this year.

wizzwizz4 · on Aug 29, 2021

When and how are you planning to change your Terms of Use / Privacy Policy? That kind of information should be in the documents. Your privacy policy is currently insufficient.

> We only collect your information for the purpose of providing this service.

Okay, but what information do you collect? “Your information” is too broad; if you're collecting my retinal scans or matching me to a behavioural analytics profile, you need to justify it. If you're not (which you're not), say so!

> You can delete your account and all your data at any time (see Profile).

But can I take all my data out? Currently, no; that's not a GDPR violation, so long as you provide it on request, but it's certainly a feature I like to have!

The 30-day deletion threshold for anonymous accounts is (IANALaTINLA) GDPR-compliant, since you have to delete personal information (on request) within 30 days, and that happens even if you can't figure out whose account details you should be deleting. Good job.

lonk11 · on Aug 29, 2021

I share your concerns about user privacy. The information LinkLonk collects is what you explicitly provide (ratings, etc) and the regular server request logs (which include your ip address, user agent). I clarified this in https://linklonk.com/privacy

I do want to add functionality to download your ratings. I'm thinking of exporting the ratings data in either the bookmarks format (ie, the format that browsers use to export bookmarks: https://support.mozilla.org/en-US/questions/1319392), csv or json. Please let me know what format would be most useful.

wizzwizz4 · on Aug 30, 2021

Yeah, that's great now. (You're collecting even less than I expected!) Thanks for making LinkLonk.

JSON would probably be easiest to start with, because it's easy to generate, easy to read and well-defined. Bookmarks would be a nice extra, though.

jaxx75 · on Aug 29, 2021

Looking forward to see where you go. I used forums primarily for my automotive hobbies, but they all seemed to have died around 2013-2015 as Facebook groups took over. Still, forums are often the best place to find good information. I worry about the day that Facebook solves the search and weekly "repeat topic" problems that are the only thing holding it back.

It's a shame phpbb, vbullettin and other big players in the space were too slow to adapt to mobile.

OJFord · on Aug 29, 2021

> It's a shame phpbb, vbullettin and other big players in the space were too slow to adapt to mobile.

Was that a problem? My memory is that everything used 'Tapatalk' for mobile, perhaps before the first iPhone even (I recall using it on an iPod Touch).

samtho · on Aug 29, 2021

I used to volunteer my time to a very major forum software. Tapatalk at the time had a very strange business model of having a plug-in/add-on that was free to the forum owner but charged for the app that the end user used. This was, even at the time, traditionally backwards from the current user interaction model: path of least resistance to engagement. It was unpopular among forum admins as they would rather buy a license for it like you would with vBulletin, Xenforo, Invision Power Board - including the people who ran open source ones like phpBB, SMF, et al.

While I understand Tapatalk have changed their business model since then, the damage was already done as the Facebook started to wholesale eat forums’ lunch as far as userbase. The biggest problem is that we never turned forum interactions into a protocol like we did with at the application level (smtp/pop, http, irc, xmpp) or on top of http like RSS, podcasts, or just plain, standardized REST APIs. This could have enabled multiple clients (like browsers) to appear and may have prevented facebooks swift dominance with online communities.

Everyone wanted to own their forum’s experience, but this stubbornness caused the fiction for users to sign up to be greater and greater. Platforms like Disqus attempted to solve this by creating an embeddable service to just drop comments in a context like a blog post, but this ultimately gave users almost no value it they were just in a shouting match against bots with generic messages laden with spam.

Facebook unified the experience for users, where the user could, with an account+app that they already had, browse and join groups and engage in discussions and become apart of communities in a way that forums could not possibly compete with.

OJFord · on Aug 29, 2021

Oh yes! I'd forgotten that aspect. Was there not anything you could do for free?

I wasn't involved with the hosting/software/ops etc. side of it at all, but I moderated 'The Computer Forum', latterly 'Computer Juice', and used it mainly with that. I fondly remember wasting an awful lot of time helping people solve Windows problems (haven't used it since.. not saying that's related..) and spec new builds.

I suppose that's all happening on StackExchange and probably some DIY custom pc Discord server or whatever these days.

eropple · on Aug 29, 2021

It was, and it was pretty terrible.

kodablah · on Aug 29, 2021

I have spent a good bit of time thinking here too, on two primary parts.

First, in my mind the difference between reddit, FB, twitter, HN, forums, etc is really just configuration. Abstracting just a tad bit higher, you can include Slack and other realtime options. I just want a curated gRPC API that implements it with pluggable auth and let others figure out discoverability and network (not activity api and not with an already built network, just persistence and auth). End to end encryption is important IMO too (even for large groups) so the host can have plausible deniability.

Second, you have to solve for hosting/network in distributed fashion without over-complicating server democratization and discoverability/naming like p2p often does. I see 2 options: 1) self hosted on at-home workstation using Tor onion service (get NAT busting for free) knowing you need an offline friendly implementation or 2) one-click easy reselling of cloud instances and domains from inside the app (this also provides a funding model).

I know of many p2p options for solving these problems, but I think we dont need to complicate at that level. As for the quality of the communities themselves, a self-hosted megaphone instead of perverse share/like broadcast incentives of companies today will automatically improve discourse (at a potential cost of creating echo chambers).

fartingflamingo · on Aug 29, 2021

> First, in my mind the difference between reddit, FB, twitter, HN, forums, etc is really just configuration.

The big boys are dopamine reinforced network effects incorporated by means of software tech. So forget about aiming for Homer Simpson.

You'll have to be happy with a small, but productive minority. Enough valuable people would rather die than use FB for niche purpose X. Start by convincing them...

spurgu · on Sept 5, 2021

A bit late to the party but I've been thinking about this too.

One of the main problems with platforms like FB and Reddit is that the posts/discussions are shortlived. They bubble up to the top in the feed when they're fresh and active but then die off and are replaced by the next thing that craves your attention.

Forum posts are sorted in chronological order, grouped into categories. Browsing through the feed you can see what topics are being actively discussed, or you can search for past discussions on a topic you like, and resurrect and old thread if you find a good one, and it gets a new life. I like this model.

One concept I've thought about is something like Reddit, where anyone can create and manage subs, but which wouldn't have the same kind of karma whoring and short attention span issues, i.e. posts/threads would live forever and not get locked after 6 months or whatever Reddit does, and posts would be sorted by activity rather than ADD points. I've found so many good X years old Reddit threads with super interesting discussions which I would've liked to jump in and resurrect but can't.

Of course an immediate problem that comes to mind is spam. If posts with new comments are lifted to the top it invites spam bumps and thus moderation. And/or it could be combined with some sort of karma system (reputation, account age etc).

And, since this wouldn't be a specialized forum you'd need to make sure you could cater to various different kinds of communities, i.e. have good multimedia sharing capabilities (for those communities only wanting to share images/videos/memes), code formatting and syntax highlighting, maybe LaTex etc...

6510 · on Aug 29, 2021

I see a great blog one time where the author was writing a book in the first post. All following postings were dressed up change logs with many interesting and useful comments.

You don't want the notification factory but if its a single post that gets updated you don't get any feed updates at all. Reading the same book again and again looking for the updated sections is also not much fun. Dropping a comment in the long list of comments under it doesn't really create a discussion. (specially not without feed updates)

You could design the publishing tools so that it "forces" the user into that pattern. People could work on multiple books or long reads but start with a crappy draft or just a bunch of links.

wizzwizz4 · on Aug 29, 2021

I'm writing a crappy book, and was thinking of using Pijul (https://pijul.org/) to do exactly this! I'd like a better solution, though.

flir · on Aug 29, 2021

The people I want to talk to are on facebook groups (my hobbies seem to be "old people" hobbies). I think you're suffering from network effects.

Sooooo..... StackOverflow solved it by starting with a vertical the founders had a lot of social juice in, and spreading in to other verticals. Possibly also by focusing very tightly on "questions and answers".

So my suggestion is "overfocus". If the big platforms have a weakness, it's that they're generic one-size-fits-all solutions. Solve one problem (Q&A, show off my project, discussion, news aggregator) for one vertical really well, then expand. An example off the top of my head might be collaborative note-taking for a class. 6510's "write a book in public" platform is also a fantastic idea at first glance.

(But skip the distributed bit - customers don't care about your architecture, and whatever USP a distributed platform has can be emulated by a centralised platform. Centralised always wins).

jordwest · on Aug 30, 2021

Great point about starting with a specific vertical. Creator communities (youtubers etc) is an area I had been thinking to focus on, though this space is mostly dominated by Discord at the moment.

> Centralised always wins

My dream isn’t necessarily to win in a financial/monopolistic sense, but rather to build a compelling enough alternative to the centralised systems that have lost their way thanks to incentives that aren’t aligned with the community.

Facebook, reddit, disqus all started out with good intentions to connect people, but have been slowly eroded by incentives to suck user attention.

So it may not the best business strategy, but I think such software should live or die on whether the community enjoys using it and is willing to (financially) support the continued existence, rather than how much attention can be siphoned into ads.

In other words, small niche communities where a few members don’t mind contributing financially rather than huge communities that rely on network effects and centralisation.

flir · on Sept 1, 2021

I've thoguht about this for a couple of days, and I want you to understand I'm coming from a place of kindness - I want you to succeed.

Ok, so. The problem you're trying to solve is "build a community that is prepared to contribute financially to the running of the site" (correct me if I'm wrong).

Distributed is one possible solution to this problem. You're in love with that solution and its time to murder your darlings. Sit down, brainstorm five other possible solutions, and honestly assess which one solves your problem best.

I think you'll have a hard time beating a subscription model.

ransom1538 · on Aug 29, 2021

I worked in forums for 4 years. It is the worst most unprofitable business you can get into. The best you can do is show low quality ads or use crappy affiliate programs. The more ads used the less the users like the forum. Forum users love being anonymous. So you provide no value to advertisers that want age,sex,location. Advertisers also HATE forums due to their ad being shown on user generated content. Drama: There are always personal attacks on moderators, posting of illegal material, threats, police involvement and constant human curation. If you build centralized forum software every forum owner works day and night to use something cheaper and get away from your control. Don't even bother offering host your own software (non centralized). "OH i know React I can build a vBulletin clone." No. No you cannot. vBulletin has been working on this software for 20 years and they offer it for $300.

minhmeoke · on Aug 29, 2021

One idea is to use some more obscure/techie protocol like Gemini [1] to create a self-selecting group of bloggers that are drawn to and choose to participate in the community, and at the same time keep spammers, commercial interests, and other unwanted influences out.

[1] https://gemini.circumlunar.space/ Earlier discussion at: https://news.ycombinator.com/item?id=23042424

ted0_2021 · on Aug 29, 2021

I'm very interested in this space too. I want to start a project that explores different ways to communicate on the web. The current state of Facebook and Twitter are not the best way in my opinion.

codingclaws · on Aug 30, 2021

You may be interested in my open source forum:

https://github.com/ferg1e/peaches-n-stink

It's basically an experimental communication platform. Right now I am building Internet forum style communication but I want to expand to other communication mechanisms later.

buro9 · on Aug 29, 2021

I tried this, got quite far. Will go a little further when I have spare cycles for it.

Example: https://www.lfgss.com based on code you'll find on GitHub under microcosm-cc

dtjohnnymonkey · on Aug 31, 2021

Reddit is building something like this https://www.reddit.com/community-points/

MaxBarraclough · on Aug 29, 2021

From your summary I'm not sure I understand what exactly your project sets out to do. People are still able to run their own independent blogs, after all. Are you thinking of blogging federation of some sort?

birbs · on Aug 29, 2021

I think Discord is then modern form of this. It’s really a great product.

wintermutestwin · on Aug 29, 2021

Isn't Discord just chat (aka IRC)? Every time I try to get on Discord, it seems chaotic and confusing. Like chat, I guess it depends on who the users are at the moment you happen to use it. Forums are a much better approach to information sharing IMO.

birbs · on Aug 30, 2021

Discord is chat in a sense. But community forums are also a form of chat. You shouldn’t confuse a technical implementation (e.g. Discord or IRC) with the end product (e.g. fostering community discussion).

paulsutter · on Aug 29, 2021

Be sure to add cryptographic signatures on the postings and up/downvotes (also with signatures). Then many people can develop content ranking and blocking, and who knows someone may get it right

wazoox · on Aug 29, 2021

Did you try Diaspora?

davehcker · on Aug 29, 2021

I have posted this here before- hexafarms.com. I am trying to use ML to discover optimal phenotype for growing plants in vertical indoor farms to a. have the higest quality produce b. to lower the cost of producing leafy green/med plants, etc. within cities itself.

Basically, every leafy green (and herbs, and even mushrooms), can grow in a range of climatic condition (phenotype, roughly) ie temperature, humidity, water, CO2 level, pH, light (spectrum, duration and intensity) etc. As you might have seen around the world there is a rise in indoor vertical farms, but the truth is that 50% of those are not even profitable. My startup wants to discover the optimal parameters for each plant grown in our indoor vertical farm and eventually I would let our AI system control everything (something like alphaGo, but for growing plant X (lettuce, kale, chard, ). Think of it as reinforcement learning with live plants! I am betting on the fact that our startup will discover the 'plant recipes' and figure out the optimal parameters for the produce that we would grow. Then, the goal is that cities can grow food cheaper in more secure and sustainable way than our 'outsourced' approach in country side or far away lands.

So now I have secured some funding to be able to start working on optimizations, but I realized that *hardware* startups are such a different kind of beast (I am a good software product dev though, I think). Honestly, if anyone with experience in hardware related startups (or experience in the kind of venture I am in) would just want to meet me and advise me, I would take it any day. Being the star of the show, it's hard for me to handle market segmentation, tech dev, team, next round of funding, European tech landscape, etc. I am foreseeing so many ways that our decisions can kill my startup, all I need is advise from someone qualified/experienced enough. My email: david[at]hexafarms.com

spiralganglion · on Aug 29, 2021

Reminder to focus on nutritive content, flavour, and crop diversity, not just yield. The past 100 years of industrial scale agriculture, with the singular goal of maximizing yields, has done incredible harm. (This has come up on HN repeatedly, so I trust you've seen it, but it's worth championing)

dehrmann · on Aug 29, 2021

> incredible harm

I agree that micronutrient content has decreased in the past century. Some might be because of scale, some might be that yield gains are mostly driven by macronutrients and water, not micronutrients, it could be selecting varieties that taste better, or it could be depleting the soil.

That said, the US has an obesity epidemic, so there's no shortage of macronutrients. Macronutrient shortages also seem rare. Scurvy and rickets aren't exactly problems.

apinstein · on Aug 29, 2021

This isn’t an answer to your ML question, but it is an answer to your problem.

I heard about a greenhouse company that has programmed their climate control to match “best growing conditions historical weather”. So, they ask local experts what year / location had the best X and then they use that region’s historical weather and replay it in their greenhouse. I thought that was brilliant!

(Just realized this was Kimbal Musk that mentioned this)

eitland · on Aug 29, 2021

When I studied farming back in 1998-1999 we once visited a greenhouse and one interesting thing I picked up was that by observation some gardeners had realized that lowering the temperature a bit extra an hour or two before sunrise they could get their flowers to be more compact instead of stretching.

This had replaced shortening hormones in modern gardening (or at least at that greenhouse, but my understanding they were just doing the same thing as everyone else).

I guess there is a lot more to learn for those who have scale enough to experiment and patience to follow through.

universa1 · on Aug 29, 2021

Hmm,

Sounds similar to what I read a long time ago about a big tomato farm in the Netherlands... Have you tried talking to actual farmers of that produce? Universities? Agricultural faculties do a lot of research in that direction.

Expensive, quickly perishable produce might be able to compete, otherwise I guess free water and energy from above in the "remote" classical farming will be hard to beat.

And then my naive guess would be that to generate enough data for a "ml" approach not only by name might be somewhat expensive.

This sounds so negative, but this is not my intention... I wish you all the best and hopefully will stumble upon a success story in the future :-)

3pt14159 · on Aug 29, 2021

I know this isn't going to sound as sexy as AlphaGo for plants, but I really think this is a classic multilinear optimization problem once you've properly labeled the data and defined the dynamics between the plants / other organisms (e.g., aquaponics). You're looking to optimize multiple variables across a set of known constraints and I think if you properly defined these constraints you could save a lot of headache / buildout by leveraging a pre-exsting toolset like Excel with the Excel Solver add-in an a couple hundred user defined functions. We're talking 1% of the work to get something useable and product-market-fitable with automatic output of graphs, etc, that clients could tune and play with locally without you needing to actually share the source sauce. Eventually you could switch to Python for something more dynamic / web based.

flir · on Aug 29, 2021

Yeah, the description made me think "simulated annealing" not "AI". I mean, even genetic algorithms might be overkill here.

mattmanser · on Aug 29, 2021

I'm not able to help, but you don't have any contact details listed against your profile or in this post. How is anyone able to contact you?

At the very least what's a link to your startup's website?

davehcker · on Aug 29, 2021

Sorry I thought my email was on my HN profile. I am sitting behind david[at]hexafarms.com

sokoloff · on Aug 29, 2021

If you’ve listed it in the email field, that’s accessible to HN admins, but not users.

If you want users to have it from your profile, put it in the “about” field.

mattmanser · on Aug 29, 2021

It sounds like an interesting project, good luck and I hope someone reaches out!

mkolodny · on Aug 29, 2021

There's some great research on using evolutionary computation to explore plant growing recipes (light strength, how long to leave the lights on, etc). In one experiment, researchers discovered that basil doesn't need to sleep - it grows best with 24 hours of light per day. Risto Miikkulainen shared the experiment on Lex Fridman's podcast: https://youtu.be/CY_LEa9xQtg?t=27m7s I believe this is the paper describing that experiment: https://journals.plos.org/plosone/article?id=10.1371/journal...

bravura · on Aug 29, 2021

This sort of ml problem is characterized by relatively expensive data labeling. Hence, hiring an expert or mixture of experts, and modeling the crop responses to their choices, will save you a lot of hill climbing The wrong part of the decision space

exdsq · on Aug 29, 2021

That sounds awesome. I’d love to work in this field. Any tips on where you learn this stuff? Currently a software dev in crypto.

qPM9l3XJrF · on Sept 1, 2021

I think you'd be better off using a Gaussian process than reinforcement learning

ampdepolymerase · on Aug 29, 2021

You need to sequence the plants otherwise you will waste too much time on tuning hyperparameters.

Rodeoclash · on Aug 29, 2021

I'm not sure if this is in the spirit of the thread but I've been working on a way to allow reviews of gameplay in video games. In short, you upload a video of you playing the game and someone who's an expert can review it.

I currently have a UI with the comments down the side of the screen which looks like this:

https://www.volt.school/videos/c980297a-417b-416f-947b-58a70...

This is good because you can easily: - See all the comments - Navigate between them - See replies etc.

However it has a huge problem with you trying to balance watching the video with reading the comments.

I also have an alternative UI I've been working on which only shows one comment at a time:

https://www.volt.school/videos-v2/c980297a-417b-416f-947b-58...

However the downsides of this is that you can't see all the comments at once. I'm not a UI/UX designer AT ALL so I'd really appreciate some pointers around how to think about making this better! The original post mentions "close to solving", I think I am pretty close but it's still not quite right and while I'm not out of ideas yet, I'd appreciate feedback if solving this is obvious to someone else.

marshallbananas · on Aug 29, 2021

How about showing the comments directly on the video, at a specific time, and a specific place.

Something like Soundcloud comments but for video.

Asian video platforms used to do that. Here's an example: https://www.youtube.com/watch?v=hOMMQmYwd4I

It's totally crazy but can be made much more coherent. Would be useful to have comments at a specific time and place on the screen just for very accurate pointers/comments.

DougBTX · on Aug 29, 2021

So maybe the problem with showing all the comments at once is that there are too many, and when showing one at a time they are not shown for long enough.

How about breaking the play into chapters/zones/rooms/segments (whatever makes sense for the game) then showing all the comments for that segment. Once the segment ends, there would be a replay button if they missed anything on the first play while reading comments, and a next segment button to carry on.

Interesting time spans could be marked for slow motion, boring bits played at double time.

There would be high level navigation between segments with thumbnails and comment counts. Buttons to skip between “pivotal moments”, maybe with voting to highlight them.

sundarurfriend · on Aug 29, 2021

The default should be to show one comment at a time, because that's convenient and quick to get into, but also with an option (maybe just a scroll down) to view all comments. One, that helps the reviewer get an overall idea of what kind of things the submitter is looking for, if they want that, and two, some submitters are inevitably gonna screw up, posting at the wrong times or asking overall summary questions that should be asked at the end right at the beginning or somesuch. So an All Questions button or similar should be there as an escape hatch, but not the primary UI.

f0e4c2f7 · on Aug 29, 2021

I find that my normal model for reading comments on videos across platforms is to not read them much of the time, but if it's a really interesting video go look at the comments and it's ok if the video is for example, fully minimized or off screen etc while I read them.

I don't know how normal my use is though or if that's at all helpful.

peripitea · on Aug 29, 2021

Don't have anything to add right now but I like the idea of this thread and would support it becoming a regular thing.

k1rcher · on Aug 29, 2021

We are having atrocious READ/WRITE latency with our PG database (api layer is django rest framework). The table that is the issue consists of multiple JSON BLOB fields, with quite a bit of data— I am convinced these need to be abstracted to their own relational tables. Is this a sound solution? I believe it is the deserialization in these fields of large nested JSON BLOBS that is causing latency. Note: this database architecture was created by a contractor. There is no indexing or relations existing in current schema. Just a single “Videos” table with all metadata stored as Postgres JSON field type blobs. EDIT: rebuilding the schema from the ground up with 5-6GB of data in the production database (not much, but still at the production level) is a hard sell, but I think it is necessary as we will be scaling enormously very soon. When I say rebuild, I mean a proper relational table layout with indexing, fk’s, etc.

EDIT2: to further comment on current table architecture, we have 3-4 other tables with minimal fields (3-4 Boolean/Char fields) that are relationally linked back the Videos table with a char field ‘video_id’, that is unique on the Videos table. Again, not a proper foreign key so no indexing.

sgk284 · on Aug 29, 2021

Are you just doing primary key lookups? If so, a new index won’t do much as Postgres already has you covered there.

If you have any foreign key columns, add indexes on them. And if you’re doing any joins, make sure the criteria have indexes.

Similarly, if you’re filtering on any of the nested JSON fields, index them directly.

This alone may be sufficient for your perf problems.

If it isn’t, then here’s some tips for the blobs.

The JSON blobs are likely already being stored in TOAST storage, so moving them to a new table might help (e.g. if you’re blindly selecting all the columns on the table) but won’t do much if you actually need to return the JSON with every query.

If you don’t need to index into the JSON, I’d consider storing them in a blob store (like S3). There are trade offs here, such as your API layer will need to read from multiple data sources, but you’ll get some nice scaling benefits here and your DB will just need to store a reference to the blob.

If your JSON blobs have a schema that you control, deprecate the blobs and break them out into explicit tables with explicit types and a proper normalized schema. Once you’ve got a properly normalized schema, you can opt-in to denormalization as needed (leveraging triggers to invalidate and update them, if needed), but I’m betting you won’t need to do any denorm’ing if you have the correct indexes here.

And since you have an API layer, ideally you’ve also already considered a caching layer in front of your DB calls, if you don’t have one yet.

k1rcher · on Aug 29, 2021

This is super interesting stuff.

First of all, I think the caching layer (which we currently don’t have) is going to be a necessity in the coming weeks as we scale for an additional project (that will be relying on this architecture)

Second of all, it is just PK lookups. We don’t actually have a single fk (contractor did not set up any relations), which makes me think moving all of this replicated JSON data from fields to tables may help.

The queries that are currently causing issues are not filtering out any data but returning entire records. In ORM terms, it is Video.objects.all(), and from a URL param in our GET to the api, limiting the amount of entries returned. What’s interesting is this latency scales linearly, and at the point we ask for ~50 records we hit the maximum raw memory alloc for PG (1GB) causing the entire app to crash.

The solution you propose for s3 blob store is enormously fascinating. The one thing I’d mention is these JSON fields on the Video table have a defined schema that is replicated for each Video record (this is video/sensor metadata, including stuff like gps coords, temperature, and a lot more).

So retrieving a Video record will retrieve those JSON fields, but not just the values: the entire nested BLOB. And does so for each and every record if we are fetching >1

Would defining this schema with something like Marshmallow/JSON-Schema be a good idea when you mention JSON schemas we control? As well as explicitly migrating those JSON fields to their own tables, replaced with an FK on the Video table?

sgk284 · on Aug 29, 2021

I do want to emphasize that the S3 approach has a lot of trade offs worth considering. There is something really nice about having all of your data in one place (transactions, backups, indexing, etc... all become trivial), and you lose that with the S3 approach. BUT in a lot of cases, splitting out blobs is fine. Just treat them as immutable, and write them to S3 first before committing your DB transaction to help ensure consistency.

Regarding JSON schema, if you have a Marshmallow schema or similar, yes that’s a wonderful starting point. This should map pretty closely to your DB schema (but may not be 1-to-1, as not every field in your DB will be needed in your API).

I’d suggest avoiding storing JSON at all in the DB unless you’re storing JSON that you don’t control.

For example, if the JSON you’re storing today has a nested object of GPS coords, temperature, etc.. make that an explicit table (or tables) as needed. The benefits are many: indexing the data becomes easier, the data is stored more efficiently, the table will take up less storage, the columns are validated for you, you can choose to return a subset of the data, etc… You will not regret it.

geitir · on Aug 29, 2021

Unrelated to post, but as you seem well informed in the field, would you agree that if a schema is not likely to change and is controlled as you put it, there is no reason to attempt to store that data as denormalized document?

Or at least as you suggest if required for performance the data would still be stored denormalized and where needed materialized / document-ized?

At my current company, there seems to be a belief that everything should be moved to mongo / cosmo (as document store) for performance reasons and moved away from sql sever. But really I think the issue is the code is using an in house orm that requires code generation for schema changes and probably less than ideal performance query generation.

But then I am also aware of the ease of horizontal scaling with the more nosql orientated products, and trying to be aware of my bias as someone who did not write the original code base.

sgk284 · on Aug 30, 2021

> would you agree that if a schema is not likely to change and is controlled as you put it, there is no reason to attempt to store that data as denormalized document

As a general rule of thumb, yes. Starting with denormalization often opens you up to all sorts of data consistency issues and data anomalies.

I like how the first sentence of the Wikipedia page on denormalization frames it (https://en.wikipedia.org/wiki/Denormalization):

> Denormalization is a strategy used on a previously-normalized database to increase performance.

The nice thing about starting with a normalized schema and then materializing denormalized views from it is that you always have a reliable source of truth to fall back on (and you'll appreciate that, on a long enough timeline).

You also tend to get better data validation, reference consistency, type checking, and data compactness with a lot less effort. That is, it comes built into the DB rather than introducing some additional framework or serialization library into your application layer.

I guess it's worth noting that denormalized data and document-oriented data aren't strictly the same, but they tend to be used in similar contexts with similar patterns and trade-offs (you could, however, have normalized data stored as documents).

Typically I suggest you start by caching your API responses. Possibly breaking up one API response into multiple cache entries, along what would be document boundaries. Denormalized documents are, in a certain lens, basically cache entries with an infinite TTL... so it's good to just start by thinking of it as a cache. And if you give them a TTL, then at least when you get inconsistencies, or need to make a massive migration, you just have to wait a little bit and the data corrects itself for "free".

Also, there are really great horizontally scalable caching solutions out there and they have very simple interfaces.

geitir · on Aug 30, 2021

Thanks for your response. The comparison between infinite ttl cache entries and a denormalized doc is an insight I can't say I've had before and makes intuitive sense

machiaweliczny · on Aug 29, 2021

Doesn't postgress have a way to index JSONB if needed?

sgk284 · on Aug 29, 2021

You can index on fields in JSONB, but I don’t believe that’s what the op is solving for here.

In either scenario, I’d still generally encourage avoiding storing JSON(B) unless there isn’t a better alternative. There are a lot of maintenance, size, I/O, and validation disadvantages to using JSON in the DB.

taesu · on Aug 29, 2021

imo, json dt should be an intermidiary step in db struc in rdb, never the final. Once you know & have stable columes, unravel the json into proper cols with indexing, it should improve the situation

if youre having issues with 5gb, you will face exponential problems when it grows due to lack of indexing

k1rcher · on Aug 29, 2021

Cheers for the response (and affirmation). After some latency profiling I am convinced proper cols with indexing will vastly improve our situation since the queries themselves are very simple.

esemathew · on Aug 29, 2021

Depending on how much of the data in your json payload is required, extract data into their own table/cols. And store the full payload in a file system/cloud storage.

machiaweliczny · on Aug 29, 2021

Also there's a way to profile which queries take longest via DB itself and then just run EXPLAIN ANALYZE to figure what's wrong

taesu · on Aug 29, 2021

You can take incremental approach and p.o.c with the data that you have so you can justify your move too!

ufmace · on Aug 29, 2021

I don't think the latency issues are necessarily related to the poor schema. I'd say to dig into the query planning for your current queries and figure out what's actually slow, since it may not be what you expect.

Rearchitecting the schema might be worth doing. From the technical side, PG is pretty nice about doing transactional schema changes. I'd be more worried about the data though. Are you sure that every single row's Json columns have the keys and value types that you expect? Usually in this type of database, some records will be weird in unexpected ways. You'll need to find and account for them all before you can migrate over to a stricter schema. And do any of them have extra unexpected data?

101008 · on Aug 29, 2021

I had to move a MongoDB to PG database on my new job (old contractor created MVP, I was hired to be the "CTO" of this new startup) and I had some problems at first, but after I created the related models and added indexes, everything worked fine.

As someone said, indexes are the best to do lookups. Remember your DB engine do lookups internally even if you are not aware (joins, for example), so add indexes to join fields.

Another thing that worked me for me (and I dont know if it's your case), was to add trigram text indexes, which make it faster to do a full text search. Remember, anyway, that adding a index makes search faster, but insert slower, so be careful if you are inserting a lot of data.

mamcx · on Aug 29, 2021

Other tips:

- Change the field type from JSON to JSONB (better storage and the rest) https://www.postgresql.org/docs/13/datatype-json.html

-Learn the json in-build functions and see if one of them can replace one you made ad-hoc

- Seriously, replace json with normal tables for the most common stuff. That alone will speed up things massively. Maybe keep the old json around in case, but remove when it become old(?)

- Use views. Views allow to abstract over your databaee and allow to change internals

- If a big things is searching and that searching is nkind of complex/flexible, add FTS with proper indexing to your json then use it as first filter layer:

    SELECT .. FROM table WHERE id IN (SELECT .. FROM search_table WHERE FTS_query) AND ...other filters

This speeeeeedupppp beautifully! (I get sub-second queries!)

- If your query do heavy calculations and your query planner show it, consider move them into a trigger and write the solved result into a table, and query that table instead. I need to loans calculations that requiere sub-second answers and this is how I solve it.

And for your query planner investigation, this handy tool is great:

https://tatiyants.com/pev/#/plans

HarryTeunissen · on Aug 29, 2021

Questions first:

1. What are the CRUD patterns to the "blobby data". 2. What are the read patterns and how much data needs to be read.

Until Read/Write are properly understood the following solutions should be considered as general guide lines only.

If staying in PG: JSON can be indexed in Postgres You could also support a hybrid JSON/Relational model giving the best of worlds.

Read:

Create views into the JSON schema that model your READ access patterns and expose them as IMMUTABLE Relational entities. (Clearly they should be as light weight as possible)

Modify:

You can split the JSON blobs into their own skinny tables. This should keep your current semantics and facilitate faster targeted updating.

Big blobby resources such as video/audio should be managed as resources and not junk up your DB

Warning:

Abstracting the model into multiple tables may cause its own issues depending on how you ORM map your entities.

Outside the Box Thinking:

-Extract and transform the data for optimized reading. -Move to MongoDB or a Key Value store

Conclusion:

What are the update patterns? -is only 1 field being updated -inter-dependencies of the data being updated How are "update anomalies" minimized

You will need to create a migration strategy to a more optimal solution and would do well to start abstracting with views. As the data model is improved this will be a continuous process and the data model can be "optimized" without disturbing the core infrastructure requiring rewrites.

Rodeoclash · on Aug 29, 2021

I had this issue at a previous job where we would query an API (AWS actually) and store the entire response payload. As we started out we would query into the JSONB fields using the JSON operators, however at some point we started to run into performance issues and ended up "lifting" the data we cared about to columns on the same record that stored the JSON.

smackeyacky · on Aug 29, 2021

Bit hard to tell without some idea of the structure of the data, but my experience has been storing blobs in the database is only a good idea if those objects are completely self contained i.e. entire files.

If you write a small program to check the integrity of your blobs i.e. that the structure of the json didn't change over time, you may be able to infer a relational table schema that isolates those bits that really need to be blobs. Too leave it too long invites long term compatibility issues if somebody changes the structure of your json objects.

Smaug123 · on Aug 29, 2021

I think your heart shouldn't quail at the thought of re-schemaing 5-6GB! I'm going to claim that the actual migration will be very quick.

k1rcher · on Aug 29, 2021

This is an affirmation I’ve been longing to here, lol!

I’ve already done the legwork, cloning to the current prod DB locally and playing around with migrations, but the fear of applying anything potentially-production breaking is scary to a dev who has never had to work on a “critical” production system!

vendiddy · on Aug 29, 2021

I would recommend setting up a staging app with a copy of the production database, testing a migration script there, then running the same script on production once you're confident.

AdrianB1 · on Aug 29, 2021

Large blobs are not the use case of relational databases - this is the starting point for any such discussion. I have 2 current projects where I am convincing the app builders (external companies, industry-wide used apps) to change this, keep relational data in the database and take out the blobs, so far is going better than expected.

TekMol · on Aug 29, 2021

I don't know about PG, but with MariaDB, a nice way to find bottlenecks is to run SHOW FULL PROCESSLIST in a loop and log the output. So you see which queries are actually taking up the most time on the production server.

If you post those queries here, we can probably give tips on how to improve the situation.

k1rcher · on Aug 29, 2021

Interesting. I believe I noted a similar function in the Postgres docs I was scouring through Friday. I’ll give it a look and see what I can find.

Tangentially related for those who have experience, I am using Django-silk for latency profiling.

taesu · on Aug 29, 2021

also, never trust orms. they make it easier to query but they do not use/output the most optimized quries

malux85 · on Aug 29, 2021

Examine the slow queries with the query planner, don’t spend a bunch of time re-architecting on a hunch until you know for sure why it’s slow!

An hour with the query planner can save you days or weeks or wasted work!

yosito · on Aug 29, 2021

This may already be solved, but one of the last pieces remaining in my quest to be Google-free is an interoperable way to sync map bookmarks (and routes, etc) between different open source mapping apps. I can manually import/export kmz files from Organic Maps and OsmAnd, and store them in a directory synced between different devices with Nextcloud, but there's no automatic way to keep them updated in the apps, and so far I haven't found a great desktop app for managing them either. The holy grail would be to also have them sync in the background to my Garmin Fenix, but I am not aware of a way to sync POIs to a Garmin watch in the background.

Related: I'd love to have an Android app with a shortcut that allows me to quickly translate Google Maps links into coordinates, OSM links or other map links. There is a browser extension that does this on desktop, so if anyone is looking for a low hanging fruit idea for an Android app, this might be a fun idea (if I don't get around to it first).

tome · on Aug 29, 2021

Have you documented your replacements for various Google technologies somewhere? I'm particularly interested in a good calendar.

yosito · on Aug 29, 2021

I'm using Nextcloud to host my calendar. On my work Mac, I connect to it using Fantastical. On my personal Ubuntu machine I use GNOME Calendar, and on Android I use https://github.com/Etar-Group/Etar-Calendar

Everything is seamless for me, though admittedly I'm not a super heavy calendar user.

I plan to do a write up on my whole Google-free setup, but I haven't done it yet, unfortunately.

tome · on Aug 29, 2021

Thanks! Do you self-host Nextcloud or did you get an account at one of the providers?

yosito · on Aug 29, 2021

I got a VPS and installed Nextcloud with Docker. I would self-host on my own server, but I'm too nomadic for that at the moment. I think the /e/ foundation has a decent managed Nextcloud setup.

stavros · on Aug 29, 2021

I built a community that aims to keep FOSS projects alive. It's meant to solve the kitchen and egg problem by having as many people and projects sign up, and then any developer who was interested could just automatically get commit permissions to any project.

It's called Code Shelter:

https://www.codeshelter.co/

It's stalled for a while, so I don't know how viable it is, but I'd appreciate any help.

smallerfish · on Aug 29, 2021

One thing you could try to solve is coordinated revival of abandoned projects - i.e. extending your model to support unsolicited takeover of projects, in the case of a maintainer's having walked away.

For example, I use a javascript library that's best in class for what it does, and yet hasn't had any real commits from its maintainer since 2016. There are 50 pull requests open, some of which fix significant bugs, or add good new features. There are literally 2000 forks of the library, some of which are published on npm but are themselves unmaintained, and almost none of which link back to the actual fork's code from npm. It's a mess, and I bet it's a situation repeated hundreds of times over.

If you were to figure out a workflow by which a maintenance team could form on your platform, and then a) the existing maintainer is pinged to request that they add the team, falling back to b) making it easy for the new team to fork and adopt existing pull requests while supporting them through initial team-forming by laying out a workflow for assigning needed-roles, then I think you'd have a valuable platform.

The key thing is ensuring there's a large enough team to start, so that yet another fork doesn't die on the vine, so maybe think about a(n old) reddit link type interface where people can link to, vote on, and volunteer for projects, with no work needed until there's critical mass and the platform moves the project forward.

stavros · on Aug 29, 2021

Hmm, that's an interesting idea, thanks. Given that finding one maintainer is already hard, though, I think finding a team would be almost impossible...

smallerfish · on Aug 29, 2021

With the voting mechanism, you wouldn't necessarily need to form a team all at once. Maybe it takes 6 months for enough people to click the "I'd participate" button on a popular project. Granted, half of them might drop out when the project graduates...but if you can try to stake the ground of "the place to suggest and coordinate forks" then at least people who were interested might find it over time.

stavros · on Aug 29, 2021

Oh hmm, I see how you mean, that's interesting... I'll think about that, thanks!

emrah · on Aug 29, 2021

> Given the high level of trust users and project owners are putting in us, we need our maintainers to already have demonstrated their trustworthiness in the community. As such, we'd like to see any popular project you are an owner/maintainer of, as it would make it easier for us to accept you.

I certainly understand the rationale but doesn't this narrow down the universe of possible maintainers while putting even more load on existing maintainers by expecting them to take on more work?

dehrmann · on Aug 29, 2021

> kitchen and egg problem

Hadn't heard that malapropism before.

stavros · on Aug 29, 2021

Oof, must have been hungry when I wrote that.

palmtree3000 · on Aug 29, 2021

Json diffing.

I haven't found any implementations I'd consider good. The problem as I see it is that there are tree based algorithms like https://webspace.science.uu.nl/~swier004/publications/2019-i... and array based algorithms like, well, text diffing, but nothing good that does both. The tree based approach struggles because there's no obvious way to choose the "good" tree encoding of an array.

I've currently settled on flattening into an array, containing things like String(...) or ArrayStart, and using an array based diffing algorithm on those, but it seems like one could do better.

te · on Aug 30, 2021

At the risk of not being helpful: I have some json files that are updated weekly that I keep under source control in git. The week-to-week updates are often fairly simple, but git was showing some crazy diffs that I knew were way more complicated than the update. I soon realized that the data provider was not consistently sorting the json arrays; when I began sorting the json arrays by rowid everytime before writing it, the diff's were as straightforward as expected. I think I don't understand the problem you're encountering, because this solution seems too obvious.

loa_in_ · on Aug 30, 2021

The problem is in the syntax of JSON itself. Use JSONL instead

dzjin · on Aug 29, 2021

I want to improve parts of online professional networking, specifically to be more about self-mentoring/shared learning, as opposed to sales connections.

This is ever more important with the onset of remote hiring, remote work, and the isolation/depersonalization it brings to newcomers to the industry.

There's also an "evil" momentum in remote hiring -- some companies _need_ asynchronous interviews to support their scaling and operations, and the general perception is that it's impersonal and dehumanizing.

This made me think that if we preemptively answered interview questions publicly, then it'd empower the job seekers to have a better profile/fight back a dehumanizing step, while allowing non-job-seekers to share the lessons that were important to them.

I've been getting decent feedback on my attempt at the solution HumblePage https://humble.page, the reality is that there's a mental hurdle to putting your honest thoughts out there.

sundarurfriend · on Aug 29, 2021

This is a nice idea, to talk about and get thinking about these "soft" questions that people often struggle with.

One feedback about the homepage: show a few examples of how people have answered questions, below the prompt. That's more helpful to get us thinking about our own answers, compared to a blank field. (Also, it's not clear what the percentages are meant to represent there. And I'm guessing the number next to the edit icon shows how many people have answered the question already? May need some UI tweaks on these.)

dzjin · on Aug 29, 2021

Your guess is close! It is the total numbers, and the green/blue ratio indicate how many people answered the prompt publicly vs privately.

My intention was to show the general comfort level of answering the prompt in public. Looking back, I wonder if I was being too quirky.

I’m thinking the same on needing UI tweaks, I’m planning for major rearrangements.

Thank you for your interest. Please feel free to reach out via the contacts if you’d like an invite.

kroltan · on Aug 29, 2021

Economically sustainable and ethical monetization of user-generated-content games.

The closest most known example of this kind of game nowadays is Roblox, but I'm thinking of things more like Mario Maker or the older-generation Atmosphir/GameGlobe-likes.

Unlike "modding platforms" or simulators/sandboxes/platforms such as Flight Simulator, VRChat or ARMA, these games' content are created by regular players with no technical skill, which means the game needs to provide both the raw assets from which to build the content, as well as the tool to build that content.

Previous titles tried the premium model (Mario Maker), user-created microtransactions (Roblox) and plain old freemium (Atmosphir and GameGlobe).

I suspect Mario Maker only works because of the immense weight and presence of the franchise.

Roblox's user-created microtransactions (in addition to first-party ones) seem to be working, but they generate strange incentives for creators, which I personally feel taints a lot of the games within it. (The user-generated content basically tends to become the equivalent of shovelware)

GameGlobe failed miserably by applying the microtransaction model to creator assets, which means that to make cool content, creators have to pay as well as spend lots of their time actually building the thing, which means most levels actually published end up being the same default bunch of free assets and limited mechanics.

Atmosphir is a bit closer to me so I find some more nuance in its demise, but long story short, essentially they restricted microtransactions only to player customization, however it didn't seem to be enough to cover the cost of developing the whole game/toolset. Eventually adding microtransactions to unlock some player mechanics, which meant that some levels were not functional without owning a specific item.

---

In short, the only thing that can effectively monetize on is the game itself (premium model) or purely-cosmetic content for players. Therefore, to incentivize the cosmetics, the game needs to be primarily multiplayer, which implies lots more investment on the creator tooling UX, as well as the infrastructure itself. But this also restricts the possibilities for the base game somewhat.

f0e4c2f7 · on Aug 29, 2021

My favorite Microtransaction systems are in Counter-Strike Global Offensive and Planetside 2.

Planetside 2 has very slight pay to win mechanics in the form of subscriptions for more xp, but it doesn't feel bad to play without pay.

Counter-strike on the other hand (and I think so other valve games too) have just about the perfect model in my mind. There are no advantages you get by paying, only status. The skins that you can buy and sell look cool but are purely cosmetic.

Even with this in mind people spend quite a lot of money (we're talking hundreds of dollars for one gun skin in some cases). It always seemed like a great way to generate revenue ethically.

One thing I will note with that model is they still have the gambling mechanics with the "crates" that open random skins. You could probably crank it up one more ethical notch by getting rid of those or trying to make them less addictive.

kroltan · on Aug 30, 2021

That is indeed my current conclusion of the least-ethically-bad viable monetization. Purely cosmetic.

However, paid status is not really a big hook for single-player games, you need to be able to show it off! This means that the game must be designed primarily around multiplayer interaction, which is fine but limits a lot the kinds of games you can implement with this monetization.

Kinrany · on Aug 29, 2021

A different but similar in topic problem is running and playing tabletop roleplaying games like Dungeons and Dragons.

The solution is a general-purpose distributed computing platform designed for end-user development.

The closest three things that exist are Google Sheets, replit.com and dndbeyond.com. Replit is too low level, dndbeyond is not powerful enough, sheets are stuck with grid and too clunky for everything else.

Here's a few things the user should be able to do:

1. Design a tabletop roleplaying game system from scratch and automate all the math

2. Write content designed to be used with a system

3. Use systems and content designed by other people, without copy-pasting

4. Modify the system and the content designed by other people for your own purposes

5. Share access to the content in a granular way

Tabletop roleplaying games are unique: they thrive on content that must be created and shared quickly, but includes simple but fully general programming capabilities. Seems like a great place to start making programming as commonplace as literacy.

jpindar · on Aug 29, 2021

I'm surprised you didn't mention Core.

https://www.coregames.com/

kroltan · on Aug 30, 2021

Yeah the post was getting a bit long, there are a bunch of similar games to the ones listed, I just used one of each to exemplify each strategy.

Core is very similar to Roblox in that the creation tools are rather involved, it tends more to a platform with distinct creator/consumer roles.

There's also the PS4 game Dreams, as well as other integrated-modding initiatives like in Krunker.

abricot · on Aug 29, 2021

It's a Roblox copy, and as such is already mentioned.

chubot · on Aug 29, 2021

These are statistics/math problems that 2 medical professionals I'm seeing are working on, not my own work. But they got me curious. FWIW I worked in "data science" as a software engineer for many years, and did some machine learning, so I have some adjacent experience, but I could use help conceptualizing the problems.

Does anyone know of any books or surveys about statistics and medicine, or specifically mechanics of the human body?

- One therapist is taking measurements of say your arm motion and making inferences about the motion of other muscles. He does is very intuitively but wants to encode the knowledge in software.

- The other one has an oral appliance that has a few parameters that need to be adjusted for different geometries of the mouth and airway.

The problems aren't that well posed which is why I'm looking for pointers to related materials, rather than specific solutions (although any ideas are welcome). I appreciate replies here, and my e-mail is in my profile. I asked a former colleague with a Ph.D. and biostats and he didn't really know. Although I guess biostats is often specifically related to genetics? Or epidemiology?

I guess the other thing this makes me think of is software for Invisalign or Lasik, etc. In fact I remember a former co-worker 10 years ago had actually worked on Lasik control software. If anyone has pointers to knowledge in these areas I'm interested.

jtolmar · on Aug 29, 2021

> One therapist is taking measurements of say your arm motion and making inferences about the motion of other muscles.

This seems like a sequential Bayesian filtering problem. Probably high enough dimension that you should just use a particle filter. The big seminal background text in this area is Bishop: Pattern Recognition and Machine Learning.

If the "motion of other muscles" is inferring pose, you could also look into what computer graphics calls inverse kinematics (a typical IK model has a number of dimensions that could fit into a particle filter). There's some more in-depth stuff in motion planning that actually takes into account muscle capability. But I wouldn't know where to find info on that, short of watching the last several years of Siggraph Technical Papers Trailers, grabbing all the motion planning ones, then reading everything they cite.

chubot · on Sept 1, 2021

Thanks, I will follow these references.

I've heard of inverse kinematics but I think it's more focused on "modeling" than statistics/probability? That is, you would have to model each muscle?

I think he is doing something that is more "invariant" across human variation? (strength, body dimensions, age, etc.) I'm not sure which is why my question was vague, but this is helpful.

jtolmar · on Sept 1, 2021

Yeah, IK is about pose and motion modeling. But you can put any state+motion model inside sequential Bayes, and get the probability that the model is in a particular configuration out.

Hard to know whether that's relevant without knowing what he's trying to predict though.

jpeloquin · on Aug 29, 2021

My research specialty is in orthopedic biomechanics. For the arm motion thing, it sounds like you might want inverse kinematics or inverse dynamics. Take a look at OpenSim: https://simtk.org/projects/opensim

For the oral appliance adjustment, I'm not sure what your output measures of interest are. If they're mechanical maybe you want to do a sensitivity analysis using FEA. Maybe look at FEBio: https://febio.org/

As for books or surveys, biomechanics is huge topic so I'm not sure what to recommend without wasting your time. If you're still defining the problem, maybe run some searches on Pubmed with the "review" and "free full text" boxes checked, and browse the results until you find which sub-sub-topic is relevant to you?

https://pubmed.ncbi.nlm.nih.gov/?term=biomechanics&filter=si...

If no one on the team knows statics, dynamics, and (if you're considering internal strain and stress) continuum mechanics, consider finding a mechanical engineer to help.

chubot · on Sept 1, 2021

Thank you for the references, I will follow these!

I think the basic idea is that when you're doing physical therapy that targets certain muscles, you have to find the muscle(s) that are limiting the motion! This is not obvious because they all interact.

Like if you have a back problem, you can try to exercise your back all you want, and that may not actually fix the problem. Because the real issue could be with your leg, which causes 16 hours a day of "bad" motion against your pelvis, which in turn messes up your back.

All the muscles in the body are interlinked and they often compensate for each other. When people have a problem in one area, they compensate in other ways.

So I have the same question as above: I think inverse kinematics is more about "modeling"? You would need to model every muscle, which is hard, and it is specific to a person?

I think his intuition is partly based on a mental model, but it's also probablistic. I think the model has to capture the things that are "invariant" across humans (i.e. basic knowledge of anatomy), and the variation between humans is the probablistic part. It's also based on variation in your personal health history / observed behavior, e.g. how you walk, how often you're sitting at a computer, etc.

So it does feel like an "inference" problem in that sense -- many factors/observations that result in multiple weighted guesses of the cause / effective therapies.

jpeloquin · on Sept 1, 2021

Inverse kinematics is about reconstructing body motion from position marker data, not really about modeling. For example, glue some tennis balls to a person's arms and legs, track their position from video of the person walking around, and use inverse kinematics to reconstruct their joint angles (their skeletal pose) across time. It's also possible to do this with marker-free methods.

Inverse dynamics takes the kinematics data from above and, in combination with ground reaction forces measured from a force plate (or instrumented footwear, etc.), calculates the forces and moments on each joint. Since control of the human musculoskeletal system is over-determined (the same motion, forces, and moments can be produced by multiple muscle activation patterns), EEG data or even ultrasound elastography is sometimes used to better constrain estimates of muscle activation patterns.

In your example the usual approach would be to use (elements of) the above methods to find out if a patient had unusual motion patterns, like the suspected abnormal leg motion in your back pain patient. Statistics comes into play once you have population data to classify as "good" or "bad", and when you're trying to determine if the hypothesized relationships between symptoms and particular motion / muscle activation patterns genuinely exist. Of course, it's fine to try different approaches (but don't forget to obtain IRB review and comply with the various regulations on human subjects research).

ad404b8a372f2b9 · on Aug 29, 2021

I can't help you conceptualize these specific problems but having worked on similar problems in the past I'd advise you to look into ordinary differential equations applied to those systems. They're used a lot for modelling in medical science and even if you're not interested in the dynamics of it they might lead you to the relevant literature for your problems and will address the parameters you're interested in and how they relate to each others.

andrewnc · on Aug 29, 2021

It sounds vaguely related to "system identification"?

udev · on Aug 29, 2021

I am blocked on finding a good (defined below) way to determine whether a product description A and product description B refer to the same product.

Imagine that a product description is a n-dimensional vector like:

  ( manufacturerName, modelName, width, height, length, color, ...)

Now imagine you have a file with m such vectors (where m is in millions), and that not all fields in the vectors are reliable info (typos, missing info, plain wrong, etc).

What is a good way to determine which product descriptions refer to the same product.

Is this even a good approach? What is state of the art? Are there simpler ways?

Here is what I mean by good:

  - robust to typos, missing info, wrong info
  - efficient since both m and n are large
  - updateable (e.g. if classification was done, and 10k new descriptins are added, how to efficiently update and avoid full recomputation)

gurgeous · on Aug 29, 2021

I have worked on this problem many times, at many companies. I am working on it again, actually. Usually some combination of scoring and persisting results in CSVs for human review.

(edit: I am at a desktop now and I can say a bit more)

Here is the process in a nutshell:

1. Create a fast hashing algorithm to find rows that might be dups. It needs to be fast because you have lots of rows. This is where SimHash, MinHash, etc. come into play. I've had good luck using simhash(name) and persisting it. Unfortunately you need to measure the hamming distance between simhashes to calculate a similarity score. This can be slow depending on your approach.

2. Create a slower scoring algorithm that measures the similarity between two rows. Think about a weighted average of diffs, where you pick the weights based on your intuition about the fields. In your case you have handy discrete fields, so this won't be too hard. The hardest field is name. Start with something simple and improve it over time. Blank fields can be scored as 0.5, meaning "unknown". Hashing photos can help here too.

3. Use (1) to find things that might be dups, then score them with (2). Dump your potential dups to a CSV for human review. As another poster indicated, I've found human review to be essential. It's easy for a human to see that "Super Mario 2" and "Super Mario 3" are very different.

4. Parse your CSV to resolve the dups as you see fit.

Have fun!

dr_zoidberg · on Aug 31, 2021

With regards to 1, I wonder: why would calculating the Hamming distance be slow? In python you can easily do it like this:

    hamming_dist = bin(a^b).count("1")

It relies on a string operations, but takes ~1 microsecond on an old i5 7200u to compare 32bit numbers. In python 3.10 we'll get int.bit_count() to get the same result without having to do these kind of things (and a ~6x speedup on the operation, but I suspect the XOR and integer handling of python might already be a large part of the running time for this calculation).

If you need to go faster, you can basically pull hamming distance with just two assembly instructions: XOR and POPCNT. I haven't gone so low level for a long time, but you should be able to get into the nanosecond speed range using those.

Radim · on Aug 29, 2021

What's your cost matrix? How much does a false positive hurt? False negative?

I built a commercial system like that for Thermo Fisher, except their descriptions were encoded as natural language text on input, not vectors (for an extra complication).

Some observations:

1. Crude methods based on vector embeddings, cosine similarity, Levenshtein, etc – don't work, if you care at all about false positives.

I see sibling comments recommend this, but it's clear this cannot work if you think about it. Values like "black" and "white", or "I" and "II" (part numbers), "with" and "without", are typically close together in such crude representations, but may lead to products that are not interchangeable.

2. A hybrid approach worked. The SW produced suggestions for which products might be duplicates (along with a soft confidence score), then let a human domain expert accept / reject these suggestions. It also learned from these expert decisions as it went, to save human time.

What I quickly learned is that even as a human (programmer with a PhD in ML), I could not look at two product descriptions and make the decision myself. Are these the same product or not? One word, even one letter, could be absolutely vital. Or absolutely irrelevant. Sometimes even the same attribute / word, depending on the product category.

Hence the final interactive solution with a domain expert in the middle. It worked well and saved time, rather clever, but not in the "hooray NN training" way. A lot of work went into normalizing the surface features intelligently based on context: units, hyphens / tokenization, typos…, because that's a mess in product sheets. The "fancy" downstream ML and clustering part was relatively simple by comparison.

But YMMV, the Thermo Fisher products were fairly specialized and sophisticated (in their millions).

wizzwizz4 · on Aug 29, 2021

Usually, I do this sort of thing somewhat manually, building up an algorithm (mostly classical, with a little ML as a treat) that can deal with the problem.

I'd start by detecting common typos. Typos are similar to un-typo'd data, so I'd do a frequency analysis on the textual representations of manufacturer name and model name, and a Levenshtein distance calculation, then synonymise the obvious synonyms (looking things up when I wasn't sure). The key idea is that you have access to more information than just this dataset: Tony and Tomy are different manufacturers, but Sony and Somy aren't (even though somy is in the dictionary and tomy isn't).

Once the manufacturer and model fields are mostly typo-free (after typo replacement – don't modify the original file, if you can help it!), you can start looking at dimensions and colour. Sort by manufacturer, and start de-duping entries. Once you get a feel for the process you're doing (e.g. under what circumstances do you check whether there's a 102mm Phillips screwthread?), you can start automating bits of it. There will always be special-cases, but your job is to get the data processed correctly, not to get the computer to process the data.

Accidentally aliasing two different products is much worse than leaving the same product described twice, so err on the side of “these are different”. (Keep in mind that manufacturers of some things, e.g. SD cards, often pretend two different products are the same – so you can't always win!) Remember, humans exist: bothering them a few million times is a problem, but bothering them a few hundred would be okay.

When new data comes in, I'd run all the code I used to come up with my system, and see if the output was notably different. If it was, I'd get the computer to let me know.

I'd also add some way for users to flag duplicates. Many humans make light work.

alex989898 · on Aug 29, 2021

You could generate word embeddings for all natural language text fields and then do cosine similarity?

ethn · on Aug 29, 2021

Use a Minhash-LSH ensemble with pre-processing on the words to fix typos via Levenshtein. Tune parameters to get the best distance

Fragoel2 · on Aug 29, 2021

Definitely some clustering method based on similarity of the vectors (there are many, pick a simple one to start)

alecmgo · on Aug 29, 2021

I want to make technical recruiters better at their job.

Many sourcers and recruiters don't have a technical background and find it very difficult to hire software engineers, especially in the current labor market which is very tight.

I'm starting off simple: writing recruiting guides from a software engineer's perspective that are easy to understand.

Are there other ways we can make technical recruiters better?

whoisjohnkid · on Aug 29, 2021

> Are there other ways we can make technical recruiters better?

- list salary range for positions

- emphasize tech stack

- emphasize number of rounds

I’ve wasted my time in the past going through the interview process only to find out the company’s budget for the position was only up to X after getting the offer; a valuable lesson I’ve learned to avoid since then of course.

I also see some recruiters only talking about what the business does … leaving out the tech stack.

If these points are clear and easily visible to recruiting leads they might get higher quality candidates.

Just my two cents. ¯\_(ツ)_/¯

edmundsauto · on Aug 29, 2021

I’ve also been thinking about this problem space. My approach is to help candidates build skills, demonstrate proficiency, in a loop with the recruiters.

Basically, take the whole “how I learned my data science skills” into something that can be done in public.

The recruiters can then see a wide range of examples, and can be better at picking up where people’s strengths and talents are.

(This is focused on analytics)

contingencies · on Aug 29, 2021

Frustrated by the degree of manual programming process in production metal machining. The industry exists largely on inertia. I would like to resolve this by applying standard optimization algorithms to a set of known machining strategies plus machine, work-holding, material, part and tool inputs. Have already analyzed the problem space to some extent and will be touring a huge production facility next week to better understand best-in-class processes from large established players. Need someone to either wrap existing simulation algorithms (any CAM system) or write enough of one (not that hard, the solution space is extremely multivariate but well understood and well documented) to make it feasible (not too hard for 2.5D machining). You can get as intellectual as you like in the solution, but remember perfect is the enemy of done. Value is huge, happy to split equity on a new entity to resolve if a workable solution for the easier subset of parts emerges in the next few weeks.

iamgopal · on Aug 29, 2021

We run about 40-50 CNC. Lots of our engineering time goes in to planning, on how to step by step machine a component so that it can reach mentioned tolerances. Sometimes required tolerance are at or below the machine accuracy. Are you going to solve this also ?

contingencies · on Aug 29, 2021

I am looking at the low hanging fruit, 80/20 right now.

pgroves · on Aug 29, 2021

How to make png encoding much faster? I'm working with large medical images and after a bit of work we can do all the needed processing in under a second (numpy/scipy methods). But then the encoding to png is taking 9-15secs. As a result we have to pre-render all possible configurations and put them on S3 b/c we can't do the processing on demand in a web request.

Is there a way to use multiple threads or GPU to encode pngs? I haven't been able to find anything. The images are 3500x3500px and compress from roughly 50mb to 15mb with maximum compression (so don't say to use lower compression).

gred · on Aug 29, 2021

I've spent some time on this problem -- classic space vs. time tradeoff. Usually if you're spending a lot of time on PNG encoding, you're spending it compressing the image content. PNG compression uses the DEFLATE format, and many software stacks leverage zlib here. It sounds like you're not simply looking to adjust the compression level (space vs. time balance), so we'll skip that.

Now zlib specifically is focused on correctness and stability, to the point of ignoring some fairly obvious opportunities to improve performance. This has led to frustration, and this frustration has led performance-focused zlib forks. The guys at AWS published a performance-focused survey [1] of the zlib fork landscape fairly recently. If your stack uses zlib, you may be able to find a way to swap in a different (faster) fork. If your stack does not use zlib, you may at least be able to find a few ideas for next steps.

[1] https://aws.amazon.com/blogs/opensource/improving-zlib-cloud...

lightcatcher · on Aug 29, 2021

I have no experience in PNG encoding, but found https://github.com/brion/mtpng The author mentions "It takes about 1.25s to save a 7680×2160 desktop screenshot PNG on this machine; 0.75s on my faster laptop." which makes me think your slower performance on smaller images either comes using the max compression setting or using hardware with worse single threaded performance.

Although these don't directly solve the PNG encoding performance problem, maybe some of these ideas could help?

* if users will be using the app in an environment with plenty of bandwidth and you don't mind paying for server bandwidth, could you serve up PNGs with less compression? Max compression takes 15s and saves 35MB's. If the users have 50mbit internet, then it only takes 5.6s to transmit the extra 35MB, so you could come out 10s ahead by not compressing. (yes, I see your comment about "don't say to use lower compression", but no reason to be killed by compression CPU cost if the bandwidth is available).

* initially show the user a lossy image (could be a downsized png) that can be quickly generated. You could then upgrade to a full quality once you finish encoding the PNG, or if server bandwidth/CPU usage is an issue then you could only upgrade if the user clicks a "high-quality" button or something. If server CPU usage is an issue, the low then high quality approach could let you turn down the compression setting and save some CPU at the cost of bandwidth and user latency.

minhmeoke · on Aug 29, 2021

Are you required to use PNG or could you save the files in an alternative lossless format like TIFF [1]? If you're stuck with PNG, mtpng [2] mentioned earlier seems to be significantly faster with multithreading (>40% reduction in encoding times). If you're publishing for web, TIFF or cwebp might also be possibilities with -mt (multithreading) and -q 25 (lower compression and larger filesize but faster) flags, or an experimental GPU implementation [3].

[1] https://blender.stackexchange.com/questions/148231/what-imag...

[2] https://github.com/brion/mtpng

[3] https://emmaliu.info/15418-Final-Project/

Const-me · on Aug 29, 2021

GPGPU is the way to go.

Not terribly hard if you only need 1-2 formats supported, e.g. RGBA8 only. You don't need to port the complete codec, only some initial portion of the pipeline and stream data back from GPUs, the last steps with lossless compression of the stream ain't a good fit for GPUs.

If you want the code to run on a web server, after you'll debug the encoder your next problem is where to deploy. NVidia teslas are frickin expensive. If you wanna run on public clouds, I'd consider their VMs with AMD GPUs.

pgroves · on Aug 29, 2021

Thanks, I hadn't heard of that and I will look into it. This is a research setting with plenty of hardware we can request and not a huge number of users so that part doesn't worry me.

Const-me · on Aug 29, 2021

> This is a research setting with plenty of hardware we can request and not a huge number of users

If you don’t care about cost of ownership, use CUDA. It only runs on nVidia GPUs, but the API is nice. I like it better than vendor-agnostic equivalents like DirectCompute, OpenCL, or Vulkan Compute.

physicles · on Aug 30, 2021

I solved a similar problem last year. As others have said, your bottleneck is the compression scheme that PNG uses. Turning down the level of compression will help. If you can build a custom intermediate format, you'll see huge gains.

Here's what that custom format might look like.

(I'm guessing these images are gray scale, so the "raw" format is uint16 or uint32)

First, take the raw data and delta encode it. This is similar to PNG's concept of "filters" -- little processors that massage the data a bit to make it more compressible. Then, since most of the compression algorithms operate on unsigned ints, you'll need to apply zigzag encoding (this is superior to allowing integer underflow, as benchmarks will show).

Then, take a look at some of the dedicated integer compression algorithms. Examples: FastPFor (or TurboPFor), BP32, snappy, simple8b, and good ol' run length encoding. These are blazing fast compared to gzip.

In my use case, I didn't care how slow compression was, so I wrote an adaptive compressor that would try all compression profiles and select the smallest one.

Of course, benchmark everything.

bjornlouser · on Aug 29, 2021

> Is there a way to use multiple threads or GPU

Maybe you could write the png without compression, compress chunks of the image in parallel using 7z, then reconstitute and decompress on the client side.

pgroves · on Aug 29, 2021

This is on our list of possibilities. It would take a little more time than I'd like to spend on this problem but it would work.

primitivesuave · on Aug 29, 2021

I would also be interested in knowing the answer to this. Currently we use OpenSeadragon to generate a map tiling of whole slide images (~4 GB per image), then stitch together and crop tiles of a particular zoom layer to produce PNGs of the desired resolution.

yboris · on Aug 29, 2021

I'm unsure if this will help, but the new image format JPEG XL (.jxl) is coming soon to replace JPEG. It will have a lossless and a lossy abilities. It claims to be faster than JPEG.

Another neat feature is that it's designed to be progressive, so you could host a single 10mb original file, and the client can download just the first 1mb (up to the quality they are comfortable with).

Take a look: https://jpegxl.info/

pgroves · on Aug 29, 2021

This is a research university that moves very slow, so waiting two years for something better is actually a possibility (and prerendering to S3 works ok for now). I'll keep this bookmarked.

dehrmann · on Aug 29, 2021

Since this is Python, which encoder are you using? I'd make sure it's in C, not Python. You might also be spending a lot of time converting numpy arrays to Python arrays.

pella · on Aug 29, 2021

also check the FPGA cards (ask the Xilinx; Altera/Intel, ...)

cygned · on Aug 29, 2021

I try to find an agile project management tool that works for us. We run on what many would call Scrum (it’s not actually Scrum).

We are on JIRA now, and it’s … JIRA. We tried basically any other tool, including Excel (yes, that is somewhat possible).

My problem generally is that tools are slow, planning is cumbersome, visibility is limited and reporting for clients is often even more limited.

Heck, I’d even write my own tool if I knew it would help others, but I am concerned it’s too close to what we already have for anyone to actually migrate.

You could help me by sharing your thoughts!

yardshop · on Aug 29, 2021

I've recently started using ClickUp for managing my helpdesk and development work and I like it a lot. I don't do scrum myself but the product claims to be useful for that kind of work, as well as many other approaches and use cases.

https://clickup.com https://clickup.com/on-demand-demo

ClickUp for Agile Workflows https://www.youtube.com/watch?v=H9hZRwivnL8

huksley · on Aug 29, 2021

Depends on workflow and team size. For small teams good fit could be some kanban based tools, for example Trello or GitHub projects.

You could also modern try agile tools, for example Linear. JIRA is good for 100+ teams and complex architectures.

designerdonna22 · on Aug 30, 2021

We use Restyaboard for Agile marketing in that you will be able to manage all your projects, teams, and clients from one single space. https://restya.com/board/demo

MildlySerious · on Aug 29, 2021

Is linear.app in the realm of what you're looking for? Have you tried that?

Not affiliated, but I've had a positive experience with it in a small team. I would describe it as an IDE for issues.

101008 · on Aug 29, 2021

Asana works really good for us. Really good UI/UX and fast, which I think it's the best feature they have :)

wtf77 · on Aug 29, 2021

have you tried https://tara.ai/?

tomcooks · on Aug 29, 2021

Try Asana

zanek · on Aug 29, 2021

I'm working on a different type of compression (for all file types). I am able to to get in the 10-20% range, but the speed to compress is to slow many times, or the compression doesnt complete at other times (I've been working on this for years). My personal website: http://danclark.org

I'm also working on a conversational search engine (using NLP) at http://supersmart.ai

BudaDude · on Aug 29, 2021

Have you looked into Middle Out compression?

zanek · on Aug 29, 2021

Funny, I've actually had a lot of fun working on this compression software. It's a weird mix of needing it to be fast and hitting a compression threshold of it being useful. One of the best projects I've embarked on

pedrokost · on Aug 29, 2021

We are experiencing very high CPU load caused by tinc [0], which we use to ensure all communication between cloud VMs is encrypted. This is primarily affecting the highest traffic VMs, including the one hosting the master DB.

I am starting to consider alternative tools such us wireguard to reduce load, but I am concerned of adding too much complexity. Tinc's mesh network makes setup and maintenance easy. The wireguard ecosystem seems to be growing very quickly, and it's possible to find tools that aim to simplify its deployment, but it's hard to see which of these tools are here to stay, and which will be replaced in a few months.

What is the best practice, in 2021, to ensure all communication between cloud VMs (even in a private network) is encrypted?

[0]https://www.tinc-vpn.org/

PhilippGille · on Aug 29, 2021

Apart from some smaller projects building on top of WireGuard, there's Tailscale [1]. One of the founders is Brad Fitzpatrick who worked on the Go team at Google before and built memcached and perkeep in the past.

Outside of the WireGuard ecosystem there's ZeroTier [2] which has been around for a while and they're working on a new version; and Nebula [3] from Slack, which is likely to be maintained as long as Slack uses it.

There might be others, but with tinc these four are the ones I've seen referred to most often.

[1] https://tailscale.com

[2] https://www.zerotier.com

[3] https://github.com/slackhq/nebula

iyn · on Aug 29, 2021

+1 for Tailscale, the product is great. I've used it in a very limited scale but can vouch for quality and performance. No CPU issues at all (even on rPi).

qchris · on Aug 29, 2021

Similar to Tailscale is the Innernet project, which has similar goals but is fully open source (also built on Wireguard). I've heard that set-up is a bit more painful, but for those who are interested in FOSS or self-hosting, it might be worth looking into.

[1] https://github.com/tonarino/innernet

ignoramous · on Aug 29, 2021

NoCode: fly.io with its 6pn (out-of-the-box private networking among clusters in the same org).

DIY: envoyproxy.io / HashiCorp Consul for app-space private networking over public interfaces.

LowCode: Mesh P2P VPN network among your clusters with FOSS/SaaS like WireTrustee / tailscale.io / Slack Nebula.

udev · on Aug 29, 2021

What kind of loads are we talking about here? How many requests per seconds? Or is each request response large?

Have you noticed whether it is worse for lots of small requests vs large data transfers?

I use a very similar setup, but haven't seen tinc CPU usage matter yet, though for very low traffic.

ID1452319 · on Aug 31, 2021

There is a juxtaposition in the UK job market. We have millions of people working in low-paid precarious jobs in retail, food service, warehousing etc. while simultaneously companies complain that they cannot recruit into highly-paid, skilled roles due to a lack of candidates.

Given that you can study Introduction to Computer Science from Harvard University, online, for free and in your own time, it seems like the barriers to building skills is lower than ever.

However, many people are put off or intimidated by the idea of studying such a course. My solution to this is some kind of mentoring, either 1-to-1 or more likely in small groups. However, this is very resource intensive for my idea to scale. I'd be very interested to hear how others might approach this, both the mentoring or the underlying encouragement to study.

dbancajas · on Aug 29, 2021

How to find motivation/energy to do a long-term creative project when having a full time jobs + other responsibilities?

bambax · on Aug 29, 2021

The brain hates to start things and loves to finish things. This can be hacked in that a working session should always leave something unfinished.

Say you're writing a novel. Every writing session (but the first, obviously) should cover the end of the last scene and the beginning of a new one, AND THEN STOP, i.e. not finish the new scene.

Your brain will want to come back to the work to finish it, which overcomes the friction of "starting" something new every time.

It's easier said than done. It's surprisingly difficult to leave something unfinished at the end of each work session. But that's the trick.

xyzzy_plugh · on Aug 29, 2021

The brain you describe is not the brain I possess.

Starting is easy.

bambax · on Aug 29, 2021

"starting" is ambiguous. I really meant "getting to work".

Thinking about new things and maybe throwing down a few ideas is indeed easy and pleasant.

But deciding to spend a few hours to move a project (instead of not) is what the brain hates. It hates commitment, and is very afraid of the opportunity cost.

jjk166 · on Aug 29, 2021

Nah, I'm with parent commenter on this. When I'm excited about something I have no problem diving into it for hours on end. But when I know that something is 90% done and it just needs to be tidied up, I will do anything other than working on it. Either everything from solving the hardest problem to being completely done happens in one sitting, or it never gets finished.

MaurizioPz · on Sept 1, 2021

I've adopted this sort of trick ending my programming session with a failing unit test. I works quite well (When I remember to do it)

tomcooks · on Aug 29, 2021

Wake up at 0430, exercise, take a shower and then work till you need to get ready for your full time job. If possible also dedicate half of your lunch hour to your project as well, together with half a weekend day.

It's quiet so early in the morning, so your productivity will skyrocket. I've coded my Paras this way while working a full time, heavy blue-collar job.

sidcool · on Aug 29, 2021

Does it not cause adverse impact on the day job?

tomcooks · on Aug 29, 2021

Quite the opposite in my experience

slim · on Aug 29, 2021

How close are you to solving that ? And please would you share your progress :)

slothtrop · on Aug 29, 2021

My current strat is to channel my scant motivation into maximizing my sleep and well-being, expecting to squeeze out more motivation from that.

michaelbuckbee · on Aug 29, 2021

I could use some help with some heuristics for Machine Learning, like how much data do I need to make a workable model, what framework/approach makes more sense given my ultimate goals.

Here's an example: there's a lot of ML tutorials on doing image identification. Like you have a series of images: picture one might have an apple and a pear in it, picture 2 might have an apple, orange, and a banana in it.

Where I'm struggling is putting this into my domain. I have a 100k images and from that around 1k distinct labels (individual images can have between 1 to 7 of these different labels), with between 13,000 to 100 images as examples in each label.

Is that enough data? Should I just start working on gathering more? Is this a bad fit for a ML solution?

thegginthesky · on Aug 29, 2021

Hey, I'm a ML practitioner for over 6 years and I'm glad to help.

1k distinct labels with a long tail distributions which is what you are describing, is definitely a challenging problem. It's called a imbalanced classification problem.

I'd focus first test how well your model will be able to predict these classes doing a stratified cross validation (stratification controlled by the label class), and measure the F-score, Weighted Accuracy and ROC AUC. Check also the precision and recall for each class. You'll definitely see that the model predicts better for the labels with more samples. The code that you use here, you'll be able to reuse later on, so keep it organized and easy to follow.

Then you have a couple options, focus on gathering more examples for the labels with small sample size, or try to oversample your dataset. This article is a good place to start https://towardsdatascience.com/4-ways-to-improve-class-imbal...

Considering this problem of image classification is normally solved with deep learning, the more data you have, the better will be your results.

michaelbuckbee · on Aug 29, 2021

This was very helpful, thank you so much!

ad404b8a372f2b9 · on Aug 29, 2021

Computer Vision is the one domain where ML (and neural networks in particular) is the undisputed king. Unless you're in an embedded application where you can't use neural networks in which case you might want to go with handcrafted features to trade-off accuracy for speed/compute-efficiency.

With regards to the size of your dataset, there's no hard rules it depends on the complexity of the task. Problems with a high number of classes are among the more difficult ones, 100 samples per class might be enough or it might not . The only way to know for sure it to try and see if you reach a performance that's acceptable for your application.

I recommend the Pytorch framework, it's coherent, easy to use and well documented (both the API reference and the examples available on SO and throughout the net). Your problem is similar to imagenet (assuming you want to detect the presence of a class and not its position, in which case it's a different problem.) so you can try to run one of the pytorch tutorials and see how well it does. The only difference is you want to detect multiple classes in one single image so you'll have to adjust your output layer and loss potentially but the network itself could remain the same. You also might want to look into doing transfer learning with a pretrained imagenet network to speed-up the training.

rolisz · on Aug 29, 2021

Define workable model. Do you care more about recall (how many of the images with label X will be labeled by your model with label X) or precision (how many of the images where the model says have label X are actually with label X)?

It is a good fit for ML, but you need to be clear on what the results will be. If you expect 100% accuracy, that won't happen. Even 90%+ accuracy would be require a lot of effort.

mhuffman · on Aug 29, 2021

If you have not at least skimmed through fast.ai, you should definitely do that, as the course itself addresses some of this and the people on their forums are among the most helpful I have ever seen!

Second, this could be more than enough! Especially if you are doing transfer learning.

Third, you can "inflate" the amount of images you have now with "Image Data Augmentation"

Y_Y · on Aug 29, 2021

The answer is, "it depends". My recommendation would be to just start trying. What you describe sounds like you could train it in a couple of days. Grab an off-the-shelf resnet and just see what happens. This is a well-studied problem, you can just look at papers that train on imagenet, and then tweak their approaches.