Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: What problem are you close to solving and how can we help?
263 points by zachrip on Aug 29, 2021 | hide | past | favorite | 472 comments
Please don't list things that just need more bodies - specifically looking for intellectual blockers that can be answered in this thread.



I want to bring back old school distributed forum communities but modernise them in a way that respects attention and isn’t a notification factory.

Mastodon is a pretty inspirational project but the Twitter influence shows, I miss the long form writing that was encouraged before our attention spans were eroded.

Not at all close to solving it, but it’s been on my mind for a long time. Would love to hear if there are others like me out there and what you imagine such a community to look like.


Is this a software problem? There are a lot of open sources (with their differences) platforms. From old forums, to more modern, etc. I think the problem is that people don't want or don't know how to use a forum.

This is based in my experience:

- Old people: They started to use internet recently so they are used to social networks (Facebook, Instagram), and newspapers websites.

- Young people: Hard to make them use the browser, if there isn't an app (Instagram, TikTok), you are lost. If they want to discuss a topic, mostly Twitter through hashtags, YouTubers or Discord.

- Adult people: This is where some of them may use a forum, but you have to be lucky enough to find adult people who use internet many years ago, and they know what is a forum. If you find a 30 or 40 years old who started to use internet 5 years ago (which happens), you are lost.

And on top of that, you need to compete against Reddit and their own subreddits.

(Edited to format)


The value of a forum is not to get everyone. It is actually in limiting a community to the most mature voices and thoughtful people.

There's little upside to having teenagers in a forum, for example, unless you're looking to monetize.


Younger people can help evolve topics by providing a set of fresh eyes. It's also important to pass knowledge on, otherwise it just ends up dying with you.

For example: Virtual reality headsets were fairly stagnant until a young guy in his early 20s tried something new.

Old people often become stuck in their ways and it requires someone new to show up and ask how the sausage is made.


"fresh eyes" can still be the people new to the form who have that year just turned 35 (eg).


I was moderating forums when I was twelve. While I experience the usual “I was so embarrassing back then” when thinking back to those times, I do think my contributions were positive and well-received.


I used many old school forums in the pre-social media days. "Mature voices and thoughtful people" were quite rare.


If you think young people won't use forums, you clearly haven't seen Scratch's. (https://scratch.mit.edu/discuss/)


>- Young people: Hard to make them use the browser, if there isn't an app (Instagram, TikTok), you are lost. If they want to discuss a topic, mostly Twitter through hashtags, YouTubers or Discord.

I disagree on this take.

GenZ can do long-form discussion and they use forums frequently, but for specific discussion(s).

Content consumption is done via forever-scroll apps because it's good to kill time; Tiktok, Instagram, etc match this well because it's just a stream of things to have fun with.

But GenZ is making great "longford" content on many traditional platforms, even blogging. The difference, as I understand it, is their approach to engagement. GenZ has seen Facebook arguments and said "no thank you". Twitter discourse also isn't really a thing -- people disagree and Tweet at people, but Twitter isn't like a forum thread and there's no way to ensure your content is associated with the content you want to respond to, so it's not effective to communicate in responses. Reddit is kind of a mystery for me as I just don't frequent it at all, but I don't get the impression that GenZ is posting frequently.

GenZ has platforms that work best when you make a statement, not where you open a discourse; Twitter is too fast/broad to respond to all comments and find real content to respond to, video content just isn't great for back and forth and becomes time consuming for lighter topics. Viewers will make whatever statement they want, but the validity of the video is based on how far a concept is spread; I'd actually position GenZ is very good at concisely expressing its idea in a simple and condensed format, and responses are not at an individual, but an idea. But, they will go to Tumblr/Medium/other long-form post when the medium is appropriate. This is one thing I like about a lot of GenZ content because they tend to be VERY good about choosing the right format for their argument; that many don't have much more to express besides tweets/tiktoks isn't an indictment of GenZ, it's praise. Think of the forums you maybe still lurk around and how many comments are just complete garbage/non-sequiturs; for forums such a post are enough to derail a topic/distract because we feel obligated to a degree to respond, but filtering noise is part of the skill of using more modern platforms.

Forums are kind of contrary to this, and also are bogged down by the before-mentioned Facebook-argument issue and the preferred platforms not really being strongest for direct rebuttals. Again, there are times when you'll see GenZ use forums or other long-form posts, but it tends to be more controlled or on 'forums-but-not-really' forums like Tumblr.

Forums have their purpose and use; but we have __many__ alternatives that make forums defunct, as some topics/ideas are far better expressed on Tiktok or Twitter, and whatever followup we end up with.


this vaguely aligns with my experiences


I am building a project (https://linklonk.com) that does information discovery in a way that respects your attention. In short, when you upvote content you connect to other people upvoted that content and to feeds that posted that content. So to get your attention other users need to prove to be a good curator of content for you.

I'm planning to do a "Show HN" post next week for it and would appreciate any feedback that I could address before it. We have about 4 active users and a couple more would be great.


Really like this idea, something I've been thinking about for a while. Will join and give the tyres a kick :-)


Thanks! It looks like 13 people signed up, which is really encouraging.

I wrote about the performance tuning I did in preparation to the Shown HN post: https://linklonk.com/item/277645707356438528

If you have any feedback please add a comment to that post.


Hey, site down?


Yes, I managed to delete the "A" DNS record last night when I was adding records for mail hosting linklonk.com. Sorry! It is up now.



Hmm... did you happen to work at Google? The concept and the UI seems very familiar.


Bug report: the UI on interacting with posts in the “From feeds and users that recommended this” section is broken.


Thank you! Indeed, the upvote button on those item-based recommendations didn't work. It is fixed now.


Seems good but the UI can be improved IMO.

I have two questions please:

- Is there a list of all the feeds used by LinkLonk?

- How many users are there today?


Do you have any specific suggestions on how to UI? Small tweaks would be most appreciated as I could implement them before the upcoming Show HN post.

To answer your questions:

1. The list of all feeds used by LinkLonk is not publicly accessible through the website. They are feeds that users explicitly submitted through https://linklonk.com/submit or feeds that LinkLonk parsed from the meta tags of the links that users submitted.

2. The number of active users has been about 4 for the last few months. I'm hoping to get it to 10 this year.


When and how are you planning to change your Terms of Use / Privacy Policy? That kind of information should be in the documents. Your privacy policy is currently insufficient.

> We only collect your information for the purpose of providing this service.

Okay, but what information do you collect? “Your information” is too broad; if you're collecting my retinal scans or matching me to a behavioural analytics profile, you need to justify it. If you're not (which you're not), say so!

> You can delete your account and all your data at any time (see Profile).

But can I take all my data out? Currently, no; that's not a GDPR violation, so long as you provide it on request, but it's certainly a feature I like to have!

The 30-day deletion threshold for anonymous accounts is (IANALaTINLA) GDPR-compliant, since you have to delete personal information (on request) within 30 days, and that happens even if you can't figure out whose account details you should be deleting. Good job.


I share your concerns about user privacy. The information LinkLonk collects is what you explicitly provide (ratings, etc) and the regular server request logs (which include your ip address, user agent). I clarified this in https://linklonk.com/privacy

I do want to add functionality to download your ratings. I'm thinking of exporting the ratings data in either the bookmarks format (ie, the format that browsers use to export bookmarks: https://support.mozilla.org/en-US/questions/1319392), csv or json. Please let me know what format would be most useful.


Yeah, that's great now. (You're collecting even less than I expected!) Thanks for making LinkLonk.

JSON would probably be easiest to start with, because it's easy to generate, easy to read and well-defined. Bookmarks would be a nice extra, though.


Looking forward to see where you go. I used forums primarily for my automotive hobbies, but they all seemed to have died around 2013-2015 as Facebook groups took over. Still, forums are often the best place to find good information. I worry about the day that Facebook solves the search and weekly "repeat topic" problems that are the only thing holding it back.

It's a shame phpbb, vbullettin and other big players in the space were too slow to adapt to mobile.


> It's a shame phpbb, vbullettin and other big players in the space were too slow to adapt to mobile.

Was that a problem? My memory is that everything used 'Tapatalk' for mobile, perhaps before the first iPhone even (I recall using it on an iPod Touch).


I used to volunteer my time to a very major forum software. Tapatalk at the time had a very strange business model of having a plug-in/add-on that was free to the forum owner but charged for the app that the end user used. This was, even at the time, traditionally backwards from the current user interaction model: path of least resistance to engagement. It was unpopular among forum admins as they would rather buy a license for it like you would with vBulletin, Xenforo, Invision Power Board - including the people who ran open source ones like phpBB, SMF, et al.

While I understand Tapatalk have changed their business model since then, the damage was already done as the Facebook started to wholesale eat forums’ lunch as far as userbase. The biggest problem is that we never turned forum interactions into a protocol like we did with at the application level (smtp/pop, http, irc, xmpp) or on top of http like RSS, podcasts, or just plain, standardized REST APIs. This could have enabled multiple clients (like browsers) to appear and may have prevented facebooks swift dominance with online communities.

Everyone wanted to own their forum’s experience, but this stubbornness caused the fiction for users to sign up to be greater and greater. Platforms like Disqus attempted to solve this by creating an embeddable service to just drop comments in a context like a blog post, but this ultimately gave users almost no value it they were just in a shouting match against bots with generic messages laden with spam.

Facebook unified the experience for users, where the user could, with an account+app that they already had, browse and join groups and engage in discussions and become apart of communities in a way that forums could not possibly compete with.


Oh yes! I'd forgotten that aspect. Was there not anything you could do for free?

I wasn't involved with the hosting/software/ops etc. side of it at all, but I moderated 'The Computer Forum', latterly 'Computer Juice', and used it mainly with that. I fondly remember wasting an awful lot of time helping people solve Windows problems (haven't used it since.. not saying that's related..) and spec new builds.

I suppose that's all happening on StackExchange and probably some DIY custom pc Discord server or whatever these days.


It was, and it was pretty terrible.


I have spent a good bit of time thinking here too, on two primary parts.

First, in my mind the difference between reddit, FB, twitter, HN, forums, etc is really just configuration. Abstracting just a tad bit higher, you can include Slack and other realtime options. I just want a curated gRPC API that implements it with pluggable auth and let others figure out discoverability and network (not activity api and not with an already built network, just persistence and auth). End to end encryption is important IMO too (even for large groups) so the host can have plausible deniability.

Second, you have to solve for hosting/network in distributed fashion without over-complicating server democratization and discoverability/naming like p2p often does. I see 2 options: 1) self hosted on at-home workstation using Tor onion service (get NAT busting for free) knowing you need an offline friendly implementation or 2) one-click easy reselling of cloud instances and domains from inside the app (this also provides a funding model).

I know of many p2p options for solving these problems, but I think we dont need to complicate at that level. As for the quality of the communities themselves, a self-hosted megaphone instead of perverse share/like broadcast incentives of companies today will automatically improve discourse (at a potential cost of creating echo chambers).


> First, in my mind the difference between reddit, FB, twitter, HN, forums, etc is really just configuration.

The big boys are dopamine reinforced network effects incorporated by means of software tech. So forget about aiming for Homer Simpson.

You'll have to be happy with a small, but productive minority. Enough valuable people would rather die than use FB for niche purpose X. Start by convincing them...


A bit late to the party but I've been thinking about this too.

One of the main problems with platforms like FB and Reddit is that the posts/discussions are shortlived. They bubble up to the top in the feed when they're fresh and active but then die off and are replaced by the next thing that craves your attention.

Forum posts are sorted in chronological order, grouped into categories. Browsing through the feed you can see what topics are being actively discussed, or you can search for past discussions on a topic you like, and resurrect and old thread if you find a good one, and it gets a new life. I like this model.

One concept I've thought about is something like Reddit, where anyone can create and manage subs, but which wouldn't have the same kind of karma whoring and short attention span issues, i.e. posts/threads would live forever and not get locked after 6 months or whatever Reddit does, and posts would be sorted by activity rather than ADD points. I've found so many good X years old Reddit threads with super interesting discussions which I would've liked to jump in and resurrect but can't.

Of course an immediate problem that comes to mind is spam. If posts with new comments are lifted to the top it invites spam bumps and thus moderation. And/or it could be combined with some sort of karma system (reputation, account age etc).

And, since this wouldn't be a specialized forum you'd need to make sure you could cater to various different kinds of communities, i.e. have good multimedia sharing capabilities (for those communities only wanting to share images/videos/memes), code formatting and syntax highlighting, maybe LaTex etc...


I see a great blog one time where the author was writing a book in the first post. All following postings were dressed up change logs with many interesting and useful comments.

You don't want the notification factory but if its a single post that gets updated you don't get any feed updates at all. Reading the same book again and again looking for the updated sections is also not much fun. Dropping a comment in the long list of comments under it doesn't really create a discussion. (specially not without feed updates)

You could design the publishing tools so that it "forces" the user into that pattern. People could work on multiple books or long reads but start with a crappy draft or just a bunch of links.


I'm writing a crappy book, and was thinking of using Pijul (https://pijul.org/) to do exactly this! I'd like a better solution, though.


The people I want to talk to are on facebook groups (my hobbies seem to be "old people" hobbies). I think you're suffering from network effects.

Sooooo..... StackOverflow solved it by starting with a vertical the founders had a lot of social juice in, and spreading in to other verticals. Possibly also by focusing very tightly on "questions and answers".

So my suggestion is "overfocus". If the big platforms have a weakness, it's that they're generic one-size-fits-all solutions. Solve one problem (Q&A, show off my project, discussion, news aggregator) for one vertical really well, then expand. An example off the top of my head might be collaborative note-taking for a class. 6510's "write a book in public" platform is also a fantastic idea at first glance.

(But skip the distributed bit - customers don't care about your architecture, and whatever USP a distributed platform has can be emulated by a centralised platform. Centralised always wins).


Great point about starting with a specific vertical. Creator communities (youtubers etc) is an area I had been thinking to focus on, though this space is mostly dominated by Discord at the moment.

> Centralised always wins

My dream isn’t necessarily to win in a financial/monopolistic sense, but rather to build a compelling enough alternative to the centralised systems that have lost their way thanks to incentives that aren’t aligned with the community.

Facebook, reddit, disqus all started out with good intentions to connect people, but have been slowly eroded by incentives to suck user attention.

So it may not the best business strategy, but I think such software should live or die on whether the community enjoys using it and is willing to (financially) support the continued existence, rather than how much attention can be siphoned into ads.

In other words, small niche communities where a few members don’t mind contributing financially rather than huge communities that rely on network effects and centralisation.


I've thoguht about this for a couple of days, and I want you to understand I'm coming from a place of kindness - I want you to succeed.

Ok, so. The problem you're trying to solve is "build a community that is prepared to contribute financially to the running of the site" (correct me if I'm wrong).

Distributed is one possible solution to this problem. You're in love with that solution and its time to murder your darlings. Sit down, brainstorm five other possible solutions, and honestly assess which one solves your problem best.

I think you'll have a hard time beating a subscription model.


I worked in forums for 4 years. It is the worst most unprofitable business you can get into. The best you can do is show low quality ads or use crappy affiliate programs. The more ads used the less the users like the forum. Forum users love being anonymous. So you provide no value to advertisers that want age,sex,location. Advertisers also HATE forums due to their ad being shown on user generated content. Drama: There are always personal attacks on moderators, posting of illegal material, threats, police involvement and constant human curation. If you build centralized forum software every forum owner works day and night to use something cheaper and get away from your control. Don't even bother offering host your own software (non centralized). "OH i know React I can build a vBulletin clone." No. No you cannot. vBulletin has been working on this software for 20 years and they offer it for $300.


One idea is to use some more obscure/techie protocol like Gemini [1] to create a self-selecting group of bloggers that are drawn to and choose to participate in the community, and at the same time keep spammers, commercial interests, and other unwanted influences out.

[1] https://gemini.circumlunar.space/ Earlier discussion at: https://news.ycombinator.com/item?id=23042424


I'm very interested in this space too. I want to start a project that explores different ways to communicate on the web. The current state of Facebook and Twitter are not the best way in my opinion.


You may be interested in my open source forum:

https://github.com/ferg1e/peaches-n-stink

It's basically an experimental communication platform. Right now I am building Internet forum style communication but I want to expand to other communication mechanisms later.


I tried this, got quite far. Will go a little further when I have spare cycles for it.

Example: https://www.lfgss.com based on code you'll find on GitHub under microcosm-cc


Reddit is building something like this https://www.reddit.com/community-points/


From your summary I'm not sure I understand what exactly your project sets out to do. People are still able to run their own independent blogs, after all. Are you thinking of blogging federation of some sort?


I think Discord is then modern form of this. It’s really a great product.


Isn't Discord just chat (aka IRC)? Every time I try to get on Discord, it seems chaotic and confusing. Like chat, I guess it depends on who the users are at the moment you happen to use it. Forums are a much better approach to information sharing IMO.


Discord is chat in a sense. But community forums are also a form of chat. You shouldn’t confuse a technical implementation (e.g. Discord or IRC) with the end product (e.g. fostering community discussion).


Be sure to add cryptographic signatures on the postings and up/downvotes (also with signatures). Then many people can develop content ranking and blocking, and who knows someone may get it right


Did you try Diaspora?


I have posted this here before- hexafarms.com. I am trying to use ML to discover optimal phenotype for growing plants in vertical indoor farms to a. have the higest quality produce b. to lower the cost of producing leafy green/med plants, etc. within cities itself.

Basically, every leafy green (and herbs, and even mushrooms), can grow in a range of climatic condition (phenotype, roughly) ie temperature, humidity, water, CO2 level, pH, light (spectrum, duration and intensity) etc. As you might have seen around the world there is a rise in indoor vertical farms, but the truth is that 50% of those are not even profitable. My startup wants to discover the optimal parameters for each plant grown in our indoor vertical farm and eventually I would let our AI system control everything (something like alphaGo, but for growing plant X (lettuce, kale, chard, ). Think of it as reinforcement learning with live plants! I am betting on the fact that our startup will discover the 'plant recipes' and figure out the optimal parameters for the produce that we would grow. Then, the goal is that cities can grow food cheaper in more secure and sustainable way than our 'outsourced' approach in country side or far away lands.

So now I have secured some funding to be able to start working on optimizations, but I realized that *hardware* startups are such a different kind of beast (I am a good software product dev though, I think). Honestly, if anyone with experience in hardware related startups (or experience in the kind of venture I am in) would just want to meet me and advise me, I would take it any day. Being the star of the show, it's hard for me to handle market segmentation, tech dev, team, next round of funding, European tech landscape, etc. I am foreseeing so many ways that our decisions can kill my startup, all I need is advise from someone qualified/experienced enough. My email: david[at]hexafarms.com


Reminder to focus on nutritive content, flavour, and crop diversity, not just yield. The past 100 years of industrial scale agriculture, with the singular goal of maximizing yields, has done incredible harm. (This has come up on HN repeatedly, so I trust you've seen it, but it's worth championing)


> incredible harm

I agree that micronutrient content has decreased in the past century. Some might be because of scale, some might be that yield gains are mostly driven by macronutrients and water, not micronutrients, it could be selecting varieties that taste better, or it could be depleting the soil.

That said, the US has an obesity epidemic, so there's no shortage of macronutrients. Macronutrient shortages also seem rare. Scurvy and rickets aren't exactly problems.


This isn’t an answer to your ML question, but it is an answer to your problem.

I heard about a greenhouse company that has programmed their climate control to match “best growing conditions historical weather”. So, they ask local experts what year / location had the best X and then they use that region’s historical weather and replay it in their greenhouse. I thought that was brilliant!

(Just realized this was Kimbal Musk that mentioned this)


When I studied farming back in 1998-1999 we once visited a greenhouse and one interesting thing I picked up was that by observation some gardeners had realized that lowering the temperature a bit extra an hour or two before sunrise they could get their flowers to be more compact instead of stretching.

This had replaced shortening hormones in modern gardening (or at least at that greenhouse, but my understanding they were just doing the same thing as everyone else).

I guess there is a lot more to learn for those who have scale enough to experiment and patience to follow through.


Hmm,

Sounds similar to what I read a long time ago about a big tomato farm in the Netherlands... Have you tried talking to actual farmers of that produce? Universities? Agricultural faculties do a lot of research in that direction.

Expensive, quickly perishable produce might be able to compete, otherwise I guess free water and energy from above in the "remote" classical farming will be hard to beat.

And then my naive guess would be that to generate enough data for a "ml" approach not only by name might be somewhat expensive.

This sounds so negative, but this is not my intention... I wish you all the best and hopefully will stumble upon a success story in the future :-)


I know this isn't going to sound as sexy as AlphaGo for plants, but I really think this is a classic multilinear optimization problem once you've properly labeled the data and defined the dynamics between the plants / other organisms (e.g., aquaponics). You're looking to optimize multiple variables across a set of known constraints and I think if you properly defined these constraints you could save a lot of headache / buildout by leveraging a pre-exsting toolset like Excel with the Excel Solver add-in an a couple hundred user defined functions. We're talking 1% of the work to get something useable and product-market-fitable with automatic output of graphs, etc, that clients could tune and play with locally without you needing to actually share the source sauce. Eventually you could switch to Python for something more dynamic / web based.


Yeah, the description made me think "simulated annealing" not "AI". I mean, even genetic algorithms might be overkill here.


I'm not able to help, but you don't have any contact details listed against your profile or in this post. How is anyone able to contact you?

At the very least what's a link to your startup's website?


Sorry I thought my email was on my HN profile. I am sitting behind david[at]hexafarms.com


If you’ve listed it in the email field, that’s accessible to HN admins, but not users.

If you want users to have it from your profile, put it in the “about” field.


It sounds like an interesting project, good luck and I hope someone reaches out!


There's some great research on using evolutionary computation to explore plant growing recipes (light strength, how long to leave the lights on, etc). In one experiment, researchers discovered that basil doesn't need to sleep - it grows best with 24 hours of light per day. Risto Miikkulainen shared the experiment on Lex Fridman's podcast: https://youtu.be/CY_LEa9xQtg?t=27m7s I believe this is the paper describing that experiment: https://journals.plos.org/plosone/article?id=10.1371/journal...


This sort of ml problem is characterized by relatively expensive data labeling. Hence, hiring an expert or mixture of experts, and modeling the crop responses to their choices, will save you a lot of hill climbing The wrong part of the decision space


That sounds awesome. I’d love to work in this field. Any tips on where you learn this stuff? Currently a software dev in crypto.


I think you'd be better off using a Gaussian process than reinforcement learning


You need to sequence the plants otherwise you will waste too much time on tuning hyperparameters.


I'm not sure if this is in the spirit of the thread but I've been working on a way to allow reviews of gameplay in video games. In short, you upload a video of you playing the game and someone who's an expert can review it.

I currently have a UI with the comments down the side of the screen which looks like this:

https://www.volt.school/videos/c980297a-417b-416f-947b-58a70...

This is good because you can easily: - See all the comments - Navigate between them - See replies etc.

However it has a huge problem with you trying to balance watching the video with reading the comments.

I also have an alternative UI I've been working on which only shows one comment at a time:

https://www.volt.school/videos-v2/c980297a-417b-416f-947b-58...

However the downsides of this is that you can't see all the comments at once. I'm not a UI/UX designer AT ALL so I'd really appreciate some pointers around how to think about making this better! The original post mentions "close to solving", I think I am pretty close but it's still not quite right and while I'm not out of ideas yet, I'd appreciate feedback if solving this is obvious to someone else.


How about showing the comments directly on the video, at a specific time, and a specific place.

Something like Soundcloud comments but for video.

Asian video platforms used to do that. Here's an example: https://www.youtube.com/watch?v=hOMMQmYwd4I

It's totally crazy but can be made much more coherent. Would be useful to have comments at a specific time and place on the screen just for very accurate pointers/comments.


So maybe the problem with showing all the comments at once is that there are too many, and when showing one at a time they are not shown for long enough.

How about breaking the play into chapters/zones/rooms/segments (whatever makes sense for the game) then showing all the comments for that segment. Once the segment ends, there would be a replay button if they missed anything on the first play while reading comments, and a next segment button to carry on.

Interesting time spans could be marked for slow motion, boring bits played at double time.

There would be high level navigation between segments with thumbnails and comment counts. Buttons to skip between “pivotal moments”, maybe with voting to highlight them.


The default should be to show one comment at a time, because that's convenient and quick to get into, but also with an option (maybe just a scroll down) to view all comments. One, that helps the reviewer get an overall idea of what kind of things the submitter is looking for, if they want that, and two, some submitters are inevitably gonna screw up, posting at the wrong times or asking overall summary questions that should be asked at the end right at the beginning or somesuch. So an All Questions button or similar should be there as an escape hatch, but not the primary UI.


I find that my normal model for reading comments on videos across platforms is to not read them much of the time, but if it's a really interesting video go look at the comments and it's ok if the video is for example, fully minimized or off screen etc while I read them.

I don't know how normal my use is though or if that's at all helpful.


Don't have anything to add right now but I like the idea of this thread and would support it becoming a regular thing.


We are having atrocious READ/WRITE latency with our PG database (api layer is django rest framework). The table that is the issue consists of multiple JSON BLOB fields, with quite a bit of data— I am convinced these need to be abstracted to their own relational tables. Is this a sound solution? I believe it is the deserialization in these fields of large nested JSON BLOBS that is causing latency. Note: this database architecture was created by a contractor. There is no indexing or relations existing in current schema. Just a single “Videos” table with all metadata stored as Postgres JSON field type blobs. EDIT: rebuilding the schema from the ground up with 5-6GB of data in the production database (not much, but still at the production level) is a hard sell, but I think it is necessary as we will be scaling enormously very soon. When I say rebuild, I mean a proper relational table layout with indexing, fk’s, etc.

EDIT2: to further comment on current table architecture, we have 3-4 other tables with minimal fields (3-4 Boolean/Char fields) that are relationally linked back the Videos table with a char field ‘video_id’, that is unique on the Videos table. Again, not a proper foreign key so no indexing.


Are you just doing primary key lookups? If so, a new index won’t do much as Postgres already has you covered there.

If you have any foreign key columns, add indexes on them. And if you’re doing any joins, make sure the criteria have indexes.

Similarly, if you’re filtering on any of the nested JSON fields, index them directly.

This alone may be sufficient for your perf problems.

If it isn’t, then here’s some tips for the blobs.

The JSON blobs are likely already being stored in TOAST storage, so moving them to a new table might help (e.g. if you’re blindly selecting all the columns on the table) but won’t do much if you actually need to return the JSON with every query.

If you don’t need to index into the JSON, I’d consider storing them in a blob store (like S3). There are trade offs here, such as your API layer will need to read from multiple data sources, but you’ll get some nice scaling benefits here and your DB will just need to store a reference to the blob.

If your JSON blobs have a schema that you control, deprecate the blobs and break them out into explicit tables with explicit types and a proper normalized schema. Once you’ve got a properly normalized schema, you can opt-in to denormalization as needed (leveraging triggers to invalidate and update them, if needed), but I’m betting you won’t need to do any denorm’ing if you have the correct indexes here.

And since you have an API layer, ideally you’ve also already considered a caching layer in front of your DB calls, if you don’t have one yet.


This is super interesting stuff.

First of all, I think the caching layer (which we currently don’t have) is going to be a necessity in the coming weeks as we scale for an additional project (that will be relying on this architecture)

Second of all, it is just PK lookups. We don’t actually have a single fk (contractor did not set up any relations), which makes me think moving all of this replicated JSON data from fields to tables may help.

The queries that are currently causing issues are not filtering out any data but returning entire records. In ORM terms, it is Video.objects.all(), and from a URL param in our GET to the api, limiting the amount of entries returned. What’s interesting is this latency scales linearly, and at the point we ask for ~50 records we hit the maximum raw memory alloc for PG (1GB) causing the entire app to crash.

The solution you propose for s3 blob store is enormously fascinating. The one thing I’d mention is these JSON fields on the Video table have a defined schema that is replicated for each Video record (this is video/sensor metadata, including stuff like gps coords, temperature, and a lot more).

So retrieving a Video record will retrieve those JSON fields, but not just the values: the entire nested BLOB. And does so for each and every record if we are fetching >1

Would defining this schema with something like Marshmallow/JSON-Schema be a good idea when you mention JSON schemas we control? As well as explicitly migrating those JSON fields to their own tables, replaced with an FK on the Video table?


I do want to emphasize that the S3 approach has a lot of trade offs worth considering. There is something really nice about having all of your data in one place (transactions, backups, indexing, etc... all become trivial), and you lose that with the S3 approach. BUT in a lot of cases, splitting out blobs is fine. Just treat them as immutable, and write them to S3 first before committing your DB transaction to help ensure consistency.

Regarding JSON schema, if you have a Marshmallow schema or similar, yes that’s a wonderful starting point. This should map pretty closely to your DB schema (but may not be 1-to-1, as not every field in your DB will be needed in your API).

I’d suggest avoiding storing JSON at all in the DB unless you’re storing JSON that you don’t control.

For example, if the JSON you’re storing today has a nested object of GPS coords, temperature, etc.. make that an explicit table (or tables) as needed. The benefits are many: indexing the data becomes easier, the data is stored more efficiently, the table will take up less storage, the columns are validated for you, you can choose to return a subset of the data, etc… You will not regret it.


Unrelated to post, but as you seem well informed in the field, would you agree that if a schema is not likely to change and is controlled as you put it, there is no reason to attempt to store that data as denormalized document?

Or at least as you suggest if required for performance the data would still be stored denormalized and where needed materialized / document-ized?

At my current company, there seems to be a belief that everything should be moved to mongo / cosmo (as document store) for performance reasons and moved away from sql sever. But really I think the issue is the code is using an in house orm that requires code generation for schema changes and probably less than ideal performance query generation.

But then I am also aware of the ease of horizontal scaling with the more nosql orientated products, and trying to be aware of my bias as someone who did not write the original code base.


> would you agree that if a schema is not likely to change and is controlled as you put it, there is no reason to attempt to store that data as denormalized document

As a general rule of thumb, yes. Starting with denormalization often opens you up to all sorts of data consistency issues and data anomalies.

I like how the first sentence of the Wikipedia page on denormalization frames it (https://en.wikipedia.org/wiki/Denormalization):

> Denormalization is a strategy used on a previously-normalized database to increase performance.

The nice thing about starting with a normalized schema and then materializing denormalized views from it is that you always have a reliable source of truth to fall back on (and you'll appreciate that, on a long enough timeline).

You also tend to get better data validation, reference consistency, type checking, and data compactness with a lot less effort. That is, it comes built into the DB rather than introducing some additional framework or serialization library into your application layer.

I guess it's worth noting that denormalized data and document-oriented data aren't strictly the same, but they tend to be used in similar contexts with similar patterns and trade-offs (you could, however, have normalized data stored as documents).

Typically I suggest you start by caching your API responses. Possibly breaking up one API response into multiple cache entries, along what would be document boundaries. Denormalized documents are, in a certain lens, basically cache entries with an infinite TTL... so it's good to just start by thinking of it as a cache. And if you give them a TTL, then at least when you get inconsistencies, or need to make a massive migration, you just have to wait a little bit and the data corrects itself for "free".

Also, there are really great horizontally scalable caching solutions out there and they have very simple interfaces.


Thanks for your response. The comparison between infinite ttl cache entries and a denormalized doc is an insight I can't say I've had before and makes intuitive sense


Doesn't postgress have a way to index JSONB if needed?


You can index on fields in JSONB, but I don’t believe that’s what the op is solving for here.

In either scenario, I’d still generally encourage avoiding storing JSON(B) unless there isn’t a better alternative. There are a lot of maintenance, size, I/O, and validation disadvantages to using JSON in the DB.


imo, json dt should be an intermidiary step in db struc in rdb, never the final. Once you know & have stable columes, unravel the json into proper cols with indexing, it should improve the situation

if youre having issues with 5gb, you will face exponential problems when it grows due to lack of indexing


Cheers for the response (and affirmation). After some latency profiling I am convinced proper cols with indexing will vastly improve our situation since the queries themselves are very simple.


Depending on how much of the data in your json payload is required, extract data into their own table/cols. And store the full payload in a file system/cloud storage.


Also there's a way to profile which queries take longest via DB itself and then just run EXPLAIN ANALYZE to figure what's wrong


You can take incremental approach and p.o.c with the data that you have so you can justify your move too!


I don't think the latency issues are necessarily related to the poor schema. I'd say to dig into the query planning for your current queries and figure out what's actually slow, since it may not be what you expect.

Rearchitecting the schema might be worth doing. From the technical side, PG is pretty nice about doing transactional schema changes. I'd be more worried about the data though. Are you sure that every single row's Json columns have the keys and value types that you expect? Usually in this type of database, some records will be weird in unexpected ways. You'll need to find and account for them all before you can migrate over to a stricter schema. And do any of them have extra unexpected data?


I had to move a MongoDB to PG database on my new job (old contractor created MVP, I was hired to be the "CTO" of this new startup) and I had some problems at first, but after I created the related models and added indexes, everything worked fine.

As someone said, indexes are the best to do lookups. Remember your DB engine do lookups internally even if you are not aware (joins, for example), so add indexes to join fields.

Another thing that worked me for me (and I dont know if it's your case), was to add trigram text indexes, which make it faster to do a full text search. Remember, anyway, that adding a index makes search faster, but insert slower, so be careful if you are inserting a lot of data.


Other tips:

- Change the field type from JSON to JSONB (better storage and the rest) https://www.postgresql.org/docs/13/datatype-json.html

-Learn the json in-build functions and see if one of them can replace one you made ad-hoc

- Seriously, replace json with normal tables for the most common stuff. That alone will speed up things massively. Maybe keep the old json around in case, but remove when it become old(?)

- Use views. Views allow to abstract over your databaee and allow to change internals

- If a big things is searching and that searching is nkind of complex/flexible, add FTS with proper indexing to your json then use it as first filter layer:

    SELECT .. FROM table WHERE id IN (SELECT .. FROM search_table WHERE FTS_query) AND ...other filters
This speeeeeedupppp beautifully! (I get sub-second queries!)

- If your query do heavy calculations and your query planner show it, consider move them into a trigger and write the solved result into a table, and query that table instead. I need to loans calculations that requiere sub-second answers and this is how I solve it.

And for your query planner investigation, this handy tool is great:

https://tatiyants.com/pev/#/plans


Questions first:

1. What are the CRUD patterns to the "blobby data". 2. What are the read patterns and how much data needs to be read.

Until Read/Write are properly understood the following solutions should be considered as general guide lines only.

If staying in PG: JSON can be indexed in Postgres You could also support a hybrid JSON/Relational model giving the best of worlds.

Read:

Create views into the JSON schema that model your READ access patterns and expose them as IMMUTABLE Relational entities. (Clearly they should be as light weight as possible)

Modify:

You can split the JSON blobs into their own skinny tables. This should keep your current semantics and facilitate faster targeted updating.

Big blobby resources such as video/audio should be managed as resources and not junk up your DB

Warning:

Abstracting the model into multiple tables may cause its own issues depending on how you ORM map your entities.

Outside the Box Thinking:

-Extract and transform the data for optimized reading. -Move to MongoDB or a Key Value store

Conclusion:

What are the update patterns? -is only 1 field being updated -inter-dependencies of the data being updated How are "update anomalies" minimized

You will need to create a migration strategy to a more optimal solution and would do well to start abstracting with views. As the data model is improved this will be a continuous process and the data model can be "optimized" without disturbing the core infrastructure requiring rewrites.


I had this issue at a previous job where we would query an API (AWS actually) and store the entire response payload. As we started out we would query into the JSONB fields using the JSON operators, however at some point we started to run into performance issues and ended up "lifting" the data we cared about to columns on the same record that stored the JSON.


Bit hard to tell without some idea of the structure of the data, but my experience has been storing blobs in the database is only a good idea if those objects are completely self contained i.e. entire files.

If you write a small program to check the integrity of your blobs i.e. that the structure of the json didn't change over time, you may be able to infer a relational table schema that isolates those bits that really need to be blobs. Too leave it too long invites long term compatibility issues if somebody changes the structure of your json objects.


I think your heart shouldn't quail at the thought of re-schemaing 5-6GB! I'm going to claim that the actual migration will be very quick.


This is an affirmation I’ve been longing to here, lol!

I’ve already done the legwork, cloning to the current prod DB locally and playing around with migrations, but the fear of applying anything potentially-production breaking is scary to a dev who has never had to work on a “critical” production system!


I would recommend setting up a staging app with a copy of the production database, testing a migration script there, then running the same script on production once you're confident.


Large blobs are not the use case of relational databases - this is the starting point for any such discussion. I have 2 current projects where I am convincing the app builders (external companies, industry-wide used apps) to change this, keep relational data in the database and take out the blobs, so far is going better than expected.


I don't know about PG, but with MariaDB, a nice way to find bottlenecks is to run SHOW FULL PROCESSLIST in a loop and log the output. So you see which queries are actually taking up the most time on the production server.

If you post those queries here, we can probably give tips on how to improve the situation.


Interesting. I believe I noted a similar function in the Postgres docs I was scouring through Friday. I’ll give it a look and see what I can find.

Tangentially related for those who have experience, I am using Django-silk for latency profiling.


also, never trust orms. they make it easier to query but they do not use/output the most optimized quries


Examine the slow queries with the query planner, don’t spend a bunch of time re-architecting on a hunch until you know for sure why it’s slow!

An hour with the query planner can save you days or weeks or wasted work!


This may already be solved, but one of the last pieces remaining in my quest to be Google-free is an interoperable way to sync map bookmarks (and routes, etc) between different open source mapping apps. I can manually import/export kmz files from Organic Maps and OsmAnd, and store them in a directory synced between different devices with Nextcloud, but there's no automatic way to keep them updated in the apps, and so far I haven't found a great desktop app for managing them either. The holy grail would be to also have them sync in the background to my Garmin Fenix, but I am not aware of a way to sync POIs to a Garmin watch in the background.

Related: I'd love to have an Android app with a shortcut that allows me to quickly translate Google Maps links into coordinates, OSM links or other map links. There is a browser extension that does this on desktop, so if anyone is looking for a low hanging fruit idea for an Android app, this might be a fun idea (if I don't get around to it first).


Have you documented your replacements for various Google technologies somewhere? I'm particularly interested in a good calendar.


I'm using Nextcloud to host my calendar. On my work Mac, I connect to it using Fantastical. On my personal Ubuntu machine I use GNOME Calendar, and on Android I use https://github.com/Etar-Group/Etar-Calendar

Everything is seamless for me, though admittedly I'm not a super heavy calendar user.

I plan to do a write up on my whole Google-free setup, but I haven't done it yet, unfortunately.


Thanks! Do you self-host Nextcloud or did you get an account at one of the providers?


I got a VPS and installed Nextcloud with Docker. I would self-host on my own server, but I'm too nomadic for that at the moment. I think the /e/ foundation has a decent managed Nextcloud setup.


I built a community that aims to keep FOSS projects alive. It's meant to solve the kitchen and egg problem by having as many people and projects sign up, and then any developer who was interested could just automatically get commit permissions to any project.

It's called Code Shelter:

https://www.codeshelter.co/

It's stalled for a while, so I don't know how viable it is, but I'd appreciate any help.


One thing you could try to solve is coordinated revival of abandoned projects - i.e. extending your model to support unsolicited takeover of projects, in the case of a maintainer's having walked away.

For example, I use a javascript library that's best in class for what it does, and yet hasn't had any real commits from its maintainer since 2016. There are 50 pull requests open, some of which fix significant bugs, or add good new features. There are literally 2000 forks of the library, some of which are published on npm but are themselves unmaintained, and almost none of which link back to the actual fork's code from npm. It's a mess, and I bet it's a situation repeated hundreds of times over.

If you were to figure out a workflow by which a maintenance team could form on your platform, and then a) the existing maintainer is pinged to request that they add the team, falling back to b) making it easy for the new team to fork and adopt existing pull requests while supporting them through initial team-forming by laying out a workflow for assigning needed-roles, then I think you'd have a valuable platform.

The key thing is ensuring there's a large enough team to start, so that yet another fork doesn't die on the vine, so maybe think about a(n old) reddit link type interface where people can link to, vote on, and volunteer for projects, with no work needed until there's critical mass and the platform moves the project forward.


Hmm, that's an interesting idea, thanks. Given that finding one maintainer is already hard, though, I think finding a team would be almost impossible...


With the voting mechanism, you wouldn't necessarily need to form a team all at once. Maybe it takes 6 months for enough people to click the "I'd participate" button on a popular project. Granted, half of them might drop out when the project graduates...but if you can try to stake the ground of "the place to suggest and coordinate forks" then at least people who were interested might find it over time.


Oh hmm, I see how you mean, that's interesting... I'll think about that, thanks!


> Given the high level of trust users and project owners are putting in us, we need our maintainers to already have demonstrated their trustworthiness in the community. As such, we'd like to see any popular project you are an owner/maintainer of, as it would make it easier for us to accept you.

I certainly understand the rationale but doesn't this narrow down the universe of possible maintainers while putting even more load on existing maintainers by expecting them to take on more work?


> kitchen and egg problem

Hadn't heard that malapropism before.


Oof, must have been hungry when I wrote that.


Json diffing.

I haven't found any implementations I'd consider good. The problem as I see it is that there are tree based algorithms like https://webspace.science.uu.nl/~swier004/publications/2019-i... and array based algorithms like, well, text diffing, but nothing good that does both. The tree based approach struggles because there's no obvious way to choose the "good" tree encoding of an array.

I've currently settled on flattening into an array, containing things like String(...) or ArrayStart, and using an array based diffing algorithm on those, but it seems like one could do better.


At the risk of not being helpful: I have some json files that are updated weekly that I keep under source control in git. The week-to-week updates are often fairly simple, but git was showing some crazy diffs that I knew were way more complicated than the update. I soon realized that the data provider was not consistently sorting the json arrays; when I began sorting the json arrays by rowid everytime before writing it, the diff's were as straightforward as expected. I think I don't understand the problem you're encountering, because this solution seems too obvious.


The problem is in the syntax of JSON itself. Use JSONL instead


I want to improve parts of online professional networking, specifically to be more about self-mentoring/shared learning, as opposed to sales connections.

This is ever more important with the onset of remote hiring, remote work, and the isolation/depersonalization it brings to newcomers to the industry.

There's also an "evil" momentum in remote hiring -- some companies _need_ asynchronous interviews to support their scaling and operations, and the general perception is that it's impersonal and dehumanizing.

This made me think that if we preemptively answered interview questions publicly, then it'd empower the job seekers to have a better profile/fight back a dehumanizing step, while allowing non-job-seekers to share the lessons that were important to them.

I've been getting decent feedback on my attempt at the solution HumblePage https://humble.page, the reality is that there's a mental hurdle to putting your honest thoughts out there.


This is a nice idea, to talk about and get thinking about these "soft" questions that people often struggle with.

One feedback about the homepage: show a few examples of how people have answered questions, below the prompt. That's more helpful to get us thinking about our own answers, compared to a blank field. (Also, it's not clear what the percentages are meant to represent there. And I'm guessing the number next to the edit icon shows how many people have answered the question already? May need some UI tweaks on these.)


Your guess is close! It is the total numbers, and the green/blue ratio indicate how many people answered the prompt publicly vs privately.

My intention was to show the general comfort level of answering the prompt in public. Looking back, I wonder if I was being too quirky.

I’m thinking the same on needing UI tweaks, I’m planning for major rearrangements.

Thank you for your interest. Please feel free to reach out via the contacts if you’d like an invite.


Economically sustainable and ethical monetization of user-generated-content games.

The closest most known example of this kind of game nowadays is Roblox, but I'm thinking of things more like Mario Maker or the older-generation Atmosphir/GameGlobe-likes.

Unlike "modding platforms" or simulators/sandboxes/platforms such as Flight Simulator, VRChat or ARMA, these games' content are created by regular players with no technical skill, which means the game needs to provide both the raw assets from which to build the content, as well as the tool to build that content.

Previous titles tried the premium model (Mario Maker), user-created microtransactions (Roblox) and plain old freemium (Atmosphir and GameGlobe).

I suspect Mario Maker only works because of the immense weight and presence of the franchise.

Roblox's user-created microtransactions (in addition to first-party ones) seem to be working, but they generate strange incentives for creators, which I personally feel taints a lot of the games within it. (The user-generated content basically tends to become the equivalent of shovelware)

GameGlobe failed miserably by applying the microtransaction model to creator assets, which means that to make cool content, creators have to pay as well as spend lots of their time actually building the thing, which means most levels actually published end up being the same default bunch of free assets and limited mechanics.

Atmosphir is a bit closer to me so I find some more nuance in its demise, but long story short, essentially they restricted microtransactions only to player customization, however it didn't seem to be enough to cover the cost of developing the whole game/toolset. Eventually adding microtransactions to unlock some player mechanics, which meant that some levels were not functional without owning a specific item.

---

In short, the only thing that can effectively monetize on is the game itself (premium model) or purely-cosmetic content for players. Therefore, to incentivize the cosmetics, the game needs to be primarily multiplayer, which implies lots more investment on the creator tooling UX, as well as the infrastructure itself. But this also restricts the possibilities for the base game somewhat.


My favorite Microtransaction systems are in Counter-Strike Global Offensive and Planetside 2.

Planetside 2 has very slight pay to win mechanics in the form of subscriptions for more xp, but it doesn't feel bad to play without pay.

Counter-strike on the other hand (and I think so other valve games too) have just about the perfect model in my mind. There are no advantages you get by paying, only status. The skins that you can buy and sell look cool but are purely cosmetic.

Even with this in mind people spend quite a lot of money (we're talking hundreds of dollars for one gun skin in some cases). It always seemed like a great way to generate revenue ethically.

One thing I will note with that model is they still have the gambling mechanics with the "crates" that open random skins. You could probably crank it up one more ethical notch by getting rid of those or trying to make them less addictive.


That is indeed my current conclusion of the least-ethically-bad viable monetization. Purely cosmetic.

However, paid status is not really a big hook for single-player games, you need to be able to show it off! This means that the game must be designed primarily around multiplayer interaction, which is fine but limits a lot the kinds of games you can implement with this monetization.


A different but similar in topic problem is running and playing tabletop roleplaying games like Dungeons and Dragons.

The solution is a general-purpose distributed computing platform designed for end-user development.

The closest three things that exist are Google Sheets, replit.com and dndbeyond.com. Replit is too low level, dndbeyond is not powerful enough, sheets are stuck with grid and too clunky for everything else.

Here's a few things the user should be able to do:

1. Design a tabletop roleplaying game system from scratch and automate all the math

2. Write content designed to be used with a system

3. Use systems and content designed by other people, without copy-pasting

4. Modify the system and the content designed by other people for your own purposes

5. Share access to the content in a granular way

Tabletop roleplaying games are unique: they thrive on content that must be created and shared quickly, but includes simple but fully general programming capabilities. Seems like a great place to start making programming as commonplace as literacy.


I'm surprised you didn't mention Core.

https://www.coregames.com/


Yeah the post was getting a bit long, there are a bunch of similar games to the ones listed, I just used one of each to exemplify each strategy.

Core is very similar to Roblox in that the creation tools are rather involved, it tends more to a platform with distinct creator/consumer roles.

There's also the PS4 game Dreams, as well as other integrated-modding initiatives like in Krunker.


It's a Roblox copy, and as such is already mentioned.


These are statistics/math problems that 2 medical professionals I'm seeing are working on, not my own work. But they got me curious. FWIW I worked in "data science" as a software engineer for many years, and did some machine learning, so I have some adjacent experience, but I could use help conceptualizing the problems.

Does anyone know of any books or surveys about statistics and medicine, or specifically mechanics of the human body?

- One therapist is taking measurements of say your arm motion and making inferences about the motion of other muscles. He does is very intuitively but wants to encode the knowledge in software.

- The other one has an oral appliance that has a few parameters that need to be adjusted for different geometries of the mouth and airway.

The problems aren't that well posed which is why I'm looking for pointers to related materials, rather than specific solutions (although any ideas are welcome). I appreciate replies here, and my e-mail is in my profile. I asked a former colleague with a Ph.D. and biostats and he didn't really know. Although I guess biostats is often specifically related to genetics? Or epidemiology?

I guess the other thing this makes me think of is software for Invisalign or Lasik, etc. In fact I remember a former co-worker 10 years ago had actually worked on Lasik control software. If anyone has pointers to knowledge in these areas I'm interested.


> One therapist is taking measurements of say your arm motion and making inferences about the motion of other muscles.

This seems like a sequential Bayesian filtering problem. Probably high enough dimension that you should just use a particle filter. The big seminal background text in this area is Bishop: Pattern Recognition and Machine Learning.

If the "motion of other muscles" is inferring pose, you could also look into what computer graphics calls inverse kinematics (a typical IK model has a number of dimensions that could fit into a particle filter). There's some more in-depth stuff in motion planning that actually takes into account muscle capability. But I wouldn't know where to find info on that, short of watching the last several years of Siggraph Technical Papers Trailers, grabbing all the motion planning ones, then reading everything they cite.


Thanks, I will follow these references.

I've heard of inverse kinematics but I think it's more focused on "modeling" than statistics/probability? That is, you would have to model each muscle?

I think he is doing something that is more "invariant" across human variation? (strength, body dimensions, age, etc.) I'm not sure which is why my question was vague, but this is helpful.


Yeah, IK is about pose and motion modeling. But you can put any state+motion model inside sequential Bayes, and get the probability that the model is in a particular configuration out.

Hard to know whether that's relevant without knowing what he's trying to predict though.


My research specialty is in orthopedic biomechanics. For the arm motion thing, it sounds like you might want inverse kinematics or inverse dynamics. Take a look at OpenSim: https://simtk.org/projects/opensim

For the oral appliance adjustment, I'm not sure what your output measures of interest are. If they're mechanical maybe you want to do a sensitivity analysis using FEA. Maybe look at FEBio: https://febio.org/

As for books or surveys, biomechanics is huge topic so I'm not sure what to recommend without wasting your time. If you're still defining the problem, maybe run some searches on Pubmed with the "review" and "free full text" boxes checked, and browse the results until you find which sub-sub-topic is relevant to you?

https://pubmed.ncbi.nlm.nih.gov/?term=biomechanics&filter=si...

If no one on the team knows statics, dynamics, and (if you're considering internal strain and stress) continuum mechanics, consider finding a mechanical engineer to help.


Thank you for the references, I will follow these!

I think the basic idea is that when you're doing physical therapy that targets certain muscles, you have to find the muscle(s) that are limiting the motion! This is not obvious because they all interact.

Like if you have a back problem, you can try to exercise your back all you want, and that may not actually fix the problem. Because the real issue could be with your leg, which causes 16 hours a day of "bad" motion against your pelvis, which in turn messes up your back.

All the muscles in the body are interlinked and they often compensate for each other. When people have a problem in one area, they compensate in other ways.

So I have the same question as above: I think inverse kinematics is more about "modeling"? You would need to model every muscle, which is hard, and it is specific to a person?

I think his intuition is partly based on a mental model, but it's also probablistic. I think the model has to capture the things that are "invariant" across humans (i.e. basic knowledge of anatomy), and the variation between humans is the probablistic part. It's also based on variation in your personal health history / observed behavior, e.g. how you walk, how often you're sitting at a computer, etc.

So it does feel like an "inference" problem in that sense -- many factors/observations that result in multiple weighted guesses of the cause / effective therapies.


Inverse kinematics is about reconstructing body motion from position marker data, not really about modeling. For example, glue some tennis balls to a person's arms and legs, track their position from video of the person walking around, and use inverse kinematics to reconstruct their joint angles (their skeletal pose) across time. It's also possible to do this with marker-free methods.

Inverse dynamics takes the kinematics data from above and, in combination with ground reaction forces measured from a force plate (or instrumented footwear, etc.), calculates the forces and moments on each joint. Since control of the human musculoskeletal system is over-determined (the same motion, forces, and moments can be produced by multiple muscle activation patterns), EEG data or even ultrasound elastography is sometimes used to better constrain estimates of muscle activation patterns.

In your example the usual approach would be to use (elements of) the above methods to find out if a patient had unusual motion patterns, like the suspected abnormal leg motion in your back pain patient. Statistics comes into play once you have population data to classify as "good" or "bad", and when you're trying to determine if the hypothesized relationships between symptoms and particular motion / muscle activation patterns genuinely exist. Of course, it's fine to try different approaches (but don't forget to obtain IRB review and comply with the various regulations on human subjects research).


I can't help you conceptualize these specific problems but having worked on similar problems in the past I'd advise you to look into ordinary differential equations applied to those systems. They're used a lot for modelling in medical science and even if you're not interested in the dynamics of it they might lead you to the relevant literature for your problems and will address the parameters you're interested in and how they relate to each others.


It sounds vaguely related to "system identification"?


I am blocked on finding a good (defined below) way to determine whether a product description A and product description B refer to the same product.

Imagine that a product description is a n-dimensional vector like:

  ( manufacturerName, modelName, width, height, length, color, ...)
Now imagine you have a file with m such vectors (where m is in millions), and that not all fields in the vectors are reliable info (typos, missing info, plain wrong, etc).

What is a good way to determine which product descriptions refer to the same product.

Is this even a good approach? What is state of the art? Are there simpler ways?

Here is what I mean by good:

  - robust to typos, missing info, wrong info
  - efficient since both m and n are large
  - updateable (e.g. if classification was done, and 10k new descriptins are added, how to efficiently update and avoid full recomputation)


I have worked on this problem many times, at many companies. I am working on it again, actually. Usually some combination of scoring and persisting results in CSVs for human review.

(edit: I am at a desktop now and I can say a bit more)

Here is the process in a nutshell:

1. Create a fast hashing algorithm to find rows that might be dups. It needs to be fast because you have lots of rows. This is where SimHash, MinHash, etc. come into play. I've had good luck using simhash(name) and persisting it. Unfortunately you need to measure the hamming distance between simhashes to calculate a similarity score. This can be slow depending on your approach.

2. Create a slower scoring algorithm that measures the similarity between two rows. Think about a weighted average of diffs, where you pick the weights based on your intuition about the fields. In your case you have handy discrete fields, so this won't be too hard. The hardest field is name. Start with something simple and improve it over time. Blank fields can be scored as 0.5, meaning "unknown". Hashing photos can help here too.

3. Use (1) to find things that might be dups, then score them with (2). Dump your potential dups to a CSV for human review. As another poster indicated, I've found human review to be essential. It's easy for a human to see that "Super Mario 2" and "Super Mario 3" are very different.

4. Parse your CSV to resolve the dups as you see fit.

Have fun!


With regards to 1, I wonder: why would calculating the Hamming distance be slow? In python you can easily do it like this:

    hamming_dist = bin(a^b).count("1")
It relies on a string operations, but takes ~1 microsecond on an old i5 7200u to compare 32bit numbers. In python 3.10 we'll get int.bit_count() to get the same result without having to do these kind of things (and a ~6x speedup on the operation, but I suspect the XOR and integer handling of python might already be a large part of the running time for this calculation).

If you need to go faster, you can basically pull hamming distance with just two assembly instructions: XOR and POPCNT. I haven't gone so low level for a long time, but you should be able to get into the nanosecond speed range using those.


What's your cost matrix? How much does a false positive hurt? False negative?

I built a commercial system like that for Thermo Fisher, except their descriptions were encoded as natural language text on input, not vectors (for an extra complication).

Some observations:

1. Crude methods based on vector embeddings, cosine similarity, Levenshtein, etc – don't work, if you care at all about false positives.

I see sibling comments recommend this, but it's clear this cannot work if you think about it. Values like "black" and "white", or "I" and "II" (part numbers), "with" and "without", are typically close together in such crude representations, but may lead to products that are not interchangeable.

2. A hybrid approach worked. The SW produced suggestions for which products might be duplicates (along with a soft confidence score), then let a human domain expert accept / reject these suggestions. It also learned from these expert decisions as it went, to save human time.

What I quickly learned is that even as a human (programmer with a PhD in ML), I could not look at two product descriptions and make the decision myself. Are these the same product or not? One word, even one letter, could be absolutely vital. Or absolutely irrelevant. Sometimes even the same attribute / word, depending on the product category.

Hence the final interactive solution with a domain expert in the middle. It worked well and saved time, rather clever, but not in the "hooray NN training" way. A lot of work went into normalizing the surface features intelligently based on context: units, hyphens / tokenization, typos…, because that's a mess in product sheets. The "fancy" downstream ML and clustering part was relatively simple by comparison.

But YMMV, the Thermo Fisher products were fairly specialized and sophisticated (in their millions).


Usually, I do this sort of thing somewhat manually, building up an algorithm (mostly classical, with a little ML as a treat) that can deal with the problem.

I'd start by detecting common typos. Typos are similar to un-typo'd data, so I'd do a frequency analysis on the textual representations of manufacturer name and model name, and a Levenshtein distance calculation, then synonymise the obvious synonyms (looking things up when I wasn't sure). The key idea is that you have access to more information than just this dataset: Tony and Tomy are different manufacturers, but Sony and Somy aren't (even though somy is in the dictionary and tomy isn't).

Once the manufacturer and model fields are mostly typo-free (after typo replacement – don't modify the original file, if you can help it!), you can start looking at dimensions and colour. Sort by manufacturer, and start de-duping entries. Once you get a feel for the process you're doing (e.g. under what circumstances do you check whether there's a 102mm Phillips screwthread?), you can start automating bits of it. There will always be special-cases, but your job is to get the data processed correctly, not to get the computer to process the data.

Accidentally aliasing two different products is much worse than leaving the same product described twice, so err on the side of “these are different”. (Keep in mind that manufacturers of some things, e.g. SD cards, often pretend two different products are the same – so you can't always win!) Remember, humans exist: bothering them a few million times is a problem, but bothering them a few hundred would be okay.

When new data comes in, I'd run all the code I used to come up with my system, and see if the output was notably different. If it was, I'd get the computer to let me know.

I'd also add some way for users to flag duplicates. Many humans make light work.


You could generate word embeddings for all natural language text fields and then do cosine similarity?


Use a Minhash-LSH ensemble with pre-processing on the words to fix typos via Levenshtein. Tune parameters to get the best distance


Definitely some clustering method based on similarity of the vectors (there are many, pick a simple one to start)


I want to make technical recruiters better at their job.

Many sourcers and recruiters don't have a technical background and find it very difficult to hire software engineers, especially in the current labor market which is very tight.

I'm starting off simple: writing recruiting guides from a software engineer's perspective that are easy to understand.

Are there other ways we can make technical recruiters better?


> Are there other ways we can make technical recruiters better?

- list salary range for positions

- emphasize tech stack

- emphasize number of rounds

I’ve wasted my time in the past going through the interview process only to find out the company’s budget for the position was only up to X after getting the offer; a valuable lesson I’ve learned to avoid since then of course.

I also see some recruiters only talking about what the business does … leaving out the tech stack.

If these points are clear and easily visible to recruiting leads they might get higher quality candidates.

Just my two cents. ¯\_(ツ)_/¯


I’ve also been thinking about this problem space. My approach is to help candidates build skills, demonstrate proficiency, in a loop with the recruiters.

Basically, take the whole “how I learned my data science skills” into something that can be done in public.

The recruiters can then see a wide range of examples, and can be better at picking up where people’s strengths and talents are.

(This is focused on analytics)


Frustrated by the degree of manual programming process in production metal machining. The industry exists largely on inertia. I would like to resolve this by applying standard optimization algorithms to a set of known machining strategies plus machine, work-holding, material, part and tool inputs. Have already analyzed the problem space to some extent and will be touring a huge production facility next week to better understand best-in-class processes from large established players. Need someone to either wrap existing simulation algorithms (any CAM system) or write enough of one (not that hard, the solution space is extremely multivariate but well understood and well documented) to make it feasible (not too hard for 2.5D machining). You can get as intellectual as you like in the solution, but remember perfect is the enemy of done. Value is huge, happy to split equity on a new entity to resolve if a workable solution for the easier subset of parts emerges in the next few weeks.


We run about 40-50 CNC. Lots of our engineering time goes in to planning, on how to step by step machine a component so that it can reach mentioned tolerances. Sometimes required tolerance are at or below the machine accuracy. Are you going to solve this also ?


I am looking at the low hanging fruit, 80/20 right now.


How to make png encoding much faster? I'm working with large medical images and after a bit of work we can do all the needed processing in under a second (numpy/scipy methods). But then the encoding to png is taking 9-15secs. As a result we have to pre-render all possible configurations and put them on S3 b/c we can't do the processing on demand in a web request.

Is there a way to use multiple threads or GPU to encode pngs? I haven't been able to find anything. The images are 3500x3500px and compress from roughly 50mb to 15mb with maximum compression (so don't say to use lower compression).


I've spent some time on this problem -- classic space vs. time tradeoff. Usually if you're spending a lot of time on PNG encoding, you're spending it compressing the image content. PNG compression uses the DEFLATE format, and many software stacks leverage zlib here. It sounds like you're not simply looking to adjust the compression level (space vs. time balance), so we'll skip that.

Now zlib specifically is focused on correctness and stability, to the point of ignoring some fairly obvious opportunities to improve performance. This has led to frustration, and this frustration has led performance-focused zlib forks. The guys at AWS published a performance-focused survey [1] of the zlib fork landscape fairly recently. If your stack uses zlib, you may be able to find a way to swap in a different (faster) fork. If your stack does not use zlib, you may at least be able to find a few ideas for next steps.

[1] https://aws.amazon.com/blogs/opensource/improving-zlib-cloud...


I have no experience in PNG encoding, but found https://github.com/brion/mtpng The author mentions "It takes about 1.25s to save a 7680×2160 desktop screenshot PNG on this machine; 0.75s on my faster laptop." which makes me think your slower performance on smaller images either comes using the max compression setting or using hardware with worse single threaded performance.

Although these don't directly solve the PNG encoding performance problem, maybe some of these ideas could help?

* if users will be using the app in an environment with plenty of bandwidth and you don't mind paying for server bandwidth, could you serve up PNGs with less compression? Max compression takes 15s and saves 35MB's. If the users have 50mbit internet, then it only takes 5.6s to transmit the extra 35MB, so you could come out 10s ahead by not compressing. (yes, I see your comment about "don't say to use lower compression", but no reason to be killed by compression CPU cost if the bandwidth is available).

* initially show the user a lossy image (could be a downsized png) that can be quickly generated. You could then upgrade to a full quality once you finish encoding the PNG, or if server bandwidth/CPU usage is an issue then you could only upgrade if the user clicks a "high-quality" button or something. If server CPU usage is an issue, the low then high quality approach could let you turn down the compression setting and save some CPU at the cost of bandwidth and user latency.


Are you required to use PNG or could you save the files in an alternative lossless format like TIFF [1]? If you're stuck with PNG, mtpng [2] mentioned earlier seems to be significantly faster with multithreading (>40% reduction in encoding times). If you're publishing for web, TIFF or cwebp might also be possibilities with -mt (multithreading) and -q 25 (lower compression and larger filesize but faster) flags, or an experimental GPU implementation [3].

[1] https://blender.stackexchange.com/questions/148231/what-imag...

[2] https://github.com/brion/mtpng

[3] https://emmaliu.info/15418-Final-Project/


GPGPU is the way to go.

Not terribly hard if you only need 1-2 formats supported, e.g. RGBA8 only. You don't need to port the complete codec, only some initial portion of the pipeline and stream data back from GPUs, the last steps with lossless compression of the stream ain't a good fit for GPUs.

If you want the code to run on a web server, after you'll debug the encoder your next problem is where to deploy. NVidia teslas are frickin expensive. If you wanna run on public clouds, I'd consider their VMs with AMD GPUs.


Thanks, I hadn't heard of that and I will look into it. This is a research setting with plenty of hardware we can request and not a huge number of users so that part doesn't worry me.


> This is a research setting with plenty of hardware we can request and not a huge number of users

If you don’t care about cost of ownership, use CUDA. It only runs on nVidia GPUs, but the API is nice. I like it better than vendor-agnostic equivalents like DirectCompute, OpenCL, or Vulkan Compute.


I solved a similar problem last year. As others have said, your bottleneck is the compression scheme that PNG uses. Turning down the level of compression will help. If you can build a custom intermediate format, you'll see huge gains.

Here's what that custom format might look like.

(I'm guessing these images are gray scale, so the "raw" format is uint16 or uint32)

First, take the raw data and delta encode it. This is similar to PNG's concept of "filters" -- little processors that massage the data a bit to make it more compressible. Then, since most of the compression algorithms operate on unsigned ints, you'll need to apply zigzag encoding (this is superior to allowing integer underflow, as benchmarks will show).

Then, take a look at some of the dedicated integer compression algorithms. Examples: FastPFor (or TurboPFor), BP32, snappy, simple8b, and good ol' run length encoding. These are blazing fast compared to gzip.

In my use case, I didn't care how slow compression was, so I wrote an adaptive compressor that would try all compression profiles and select the smallest one.

Of course, benchmark everything.


> Is there a way to use multiple threads or GPU

Maybe you could write the png without compression, compress chunks of the image in parallel using 7z, then reconstitute and decompress on the client side.


This is on our list of possibilities. It would take a little more time than I'd like to spend on this problem but it would work.


I would also be interested in knowing the answer to this. Currently we use OpenSeadragon to generate a map tiling of whole slide images (~4 GB per image), then stitch together and crop tiles of a particular zoom layer to produce PNGs of the desired resolution.


I'm unsure if this will help, but the new image format JPEG XL (.jxl) is coming soon to replace JPEG. It will have a lossless and a lossy abilities. It claims to be faster than JPEG.

Another neat feature is that it's designed to be progressive, so you could host a single 10mb original file, and the client can download just the first 1mb (up to the quality they are comfortable with).

Take a look: https://jpegxl.info/


This is a research university that moves very slow, so waiting two years for something better is actually a possibility (and prerendering to S3 works ok for now). I'll keep this bookmarked.


Since this is Python, which encoder are you using? I'd make sure it's in C, not Python. You might also be spending a lot of time converting numpy arrays to Python arrays.


also check the FPGA cards (ask the Xilinx; Altera/Intel, ...)


I try to find an agile project management tool that works for us. We run on what many would call Scrum (it’s not actually Scrum).

We are on JIRA now, and it’s … JIRA. We tried basically any other tool, including Excel (yes, that is somewhat possible).

My problem generally is that tools are slow, planning is cumbersome, visibility is limited and reporting for clients is often even more limited.

Heck, I’d even write my own tool if I knew it would help others, but I am concerned it’s too close to what we already have for anyone to actually migrate.

You could help me by sharing your thoughts!


I've recently started using ClickUp for managing my helpdesk and development work and I like it a lot. I don't do scrum myself but the product claims to be useful for that kind of work, as well as many other approaches and use cases.

https://clickup.com https://clickup.com/on-demand-demo

ClickUp for Agile Workflows https://www.youtube.com/watch?v=H9hZRwivnL8


Depends on workflow and team size. For small teams good fit could be some kanban based tools, for example Trello or GitHub projects.

You could also modern try agile tools, for example Linear. JIRA is good for 100+ teams and complex architectures.


We use Restyaboard for Agile marketing in that you will be able to manage all your projects, teams, and clients from one single space. https://restya.com/board/demo


Is linear.app in the realm of what you're looking for? Have you tried that?

Not affiliated, but I've had a positive experience with it in a small team. I would describe it as an IDE for issues.


Asana works really good for us. Really good UI/UX and fast, which I think it's the best feature they have :)


have you tried https://tara.ai/?


Try Asana


I'm working on a different type of compression (for all file types). I am able to to get in the 10-20% range, but the speed to compress is to slow many times, or the compression doesnt complete at other times (I've been working on this for years). My personal website: http://danclark.org

I'm also working on a conversational search engine (using NLP) at http://supersmart.ai


Have you looked into Middle Out compression?


Funny, I've actually had a lot of fun working on this compression software. It's a weird mix of needing it to be fast and hitting a compression threshold of it being useful. One of the best projects I've embarked on


We are experiencing very high CPU load caused by tinc [0], which we use to ensure all communication between cloud VMs is encrypted. This is primarily affecting the highest traffic VMs, including the one hosting the master DB.

I am starting to consider alternative tools such us wireguard to reduce load, but I am concerned of adding too much complexity. Tinc's mesh network makes setup and maintenance easy. The wireguard ecosystem seems to be growing very quickly, and it's possible to find tools that aim to simplify its deployment, but it's hard to see which of these tools are here to stay, and which will be replaced in a few months.

What is the best practice, in 2021, to ensure all communication between cloud VMs (even in a private network) is encrypted?

[0]https://www.tinc-vpn.org/


Apart from some smaller projects building on top of WireGuard, there's Tailscale [1]. One of the founders is Brad Fitzpatrick who worked on the Go team at Google before and built memcached and perkeep in the past.

Outside of the WireGuard ecosystem there's ZeroTier [2] which has been around for a while and they're working on a new version; and Nebula [3] from Slack, which is likely to be maintained as long as Slack uses it.

There might be others, but with tinc these four are the ones I've seen referred to most often.

[1] https://tailscale.com

[2] https://www.zerotier.com

[3] https://github.com/slackhq/nebula


+1 for Tailscale, the product is great. I've used it in a very limited scale but can vouch for quality and performance. No CPU issues at all (even on rPi).


Similar to Tailscale is the Innernet project, which has similar goals but is fully open source (also built on Wireguard). I've heard that set-up is a bit more painful, but for those who are interested in FOSS or self-hosting, it might be worth looking into.

[1] https://github.com/tonarino/innernet


NoCode: fly.io with its 6pn (out-of-the-box private networking among clusters in the same org).

DIY: envoyproxy.io / HashiCorp Consul for app-space private networking over public interfaces.

LowCode: Mesh P2P VPN network among your clusters with FOSS/SaaS like WireTrustee / tailscale.io / Slack Nebula.


What kind of loads are we talking about here? How many requests per seconds? Or is each request response large?

Have you noticed whether it is worse for lots of small requests vs large data transfers?

I use a very similar setup, but haven't seen tinc CPU usage matter yet, though for very low traffic.


There is a juxtaposition in the UK job market. We have millions of people working in low-paid precarious jobs in retail, food service, warehousing etc. while simultaneously companies complain that they cannot recruit into highly-paid, skilled roles due to a lack of candidates.

Given that you can study Introduction to Computer Science from Harvard University, online, for free and in your own time, it seems like the barriers to building skills is lower than ever.

However, many people are put off or intimidated by the idea of studying such a course. My solution to this is some kind of mentoring, either 1-to-1 or more likely in small groups. However, this is very resource intensive for my idea to scale. I'd be very interested to hear how others might approach this, both the mentoring or the underlying encouragement to study.


How to find motivation/energy to do a long-term creative project when having a full time jobs + other responsibilities?


The brain hates to start things and loves to finish things. This can be hacked in that a working session should always leave something unfinished.

Say you're writing a novel. Every writing session (but the first, obviously) should cover the end of the last scene and the beginning of a new one, AND THEN STOP, i.e. not finish the new scene.

Your brain will want to come back to the work to finish it, which overcomes the friction of "starting" something new every time.

It's easier said than done. It's surprisingly difficult to leave something unfinished at the end of each work session. But that's the trick.


The brain you describe is not the brain I possess.

Starting is easy.


"starting" is ambiguous. I really meant "getting to work".

Thinking about new things and maybe throwing down a few ideas is indeed easy and pleasant.

But deciding to spend a few hours to move a project (instead of not) is what the brain hates. It hates commitment, and is very afraid of the opportunity cost.


Nah, I'm with parent commenter on this. When I'm excited about something I have no problem diving into it for hours on end. But when I know that something is 90% done and it just needs to be tidied up, I will do anything other than working on it. Either everything from solving the hardest problem to being completely done happens in one sitting, or it never gets finished.


I've adopted this sort of trick ending my programming session with a failing unit test. I works quite well (When I remember to do it)


Wake up at 0430, exercise, take a shower and then work till you need to get ready for your full time job. If possible also dedicate half of your lunch hour to your project as well, together with half a weekend day.

It's quiet so early in the morning, so your productivity will skyrocket. I've coded my Paras this way while working a full time, heavy blue-collar job.


Does it not cause adverse impact on the day job?


Quite the opposite in my experience


How close are you to solving that ? And please would you share your progress :)


My current strat is to channel my scant motivation into maximizing my sleep and well-being, expecting to squeeze out more motivation from that.


I could use some help with some heuristics for Machine Learning, like how much data do I need to make a workable model, what framework/approach makes more sense given my ultimate goals.

Here's an example: there's a lot of ML tutorials on doing image identification. Like you have a series of images: picture one might have an apple and a pear in it, picture 2 might have an apple, orange, and a banana in it.

Where I'm struggling is putting this into my domain. I have a 100k images and from that around 1k distinct labels (individual images can have between 1 to 7 of these different labels), with between 13,000 to 100 images as examples in each label.

Is that enough data? Should I just start working on gathering more? Is this a bad fit for a ML solution?


Hey, I'm a ML practitioner for over 6 years and I'm glad to help.

1k distinct labels with a long tail distributions which is what you are describing, is definitely a challenging problem. It's called a imbalanced classification problem.

I'd focus first test how well your model will be able to predict these classes doing a stratified cross validation (stratification controlled by the label class), and measure the F-score, Weighted Accuracy and ROC AUC. Check also the precision and recall for each class. You'll definitely see that the model predicts better for the labels with more samples. The code that you use here, you'll be able to reuse later on, so keep it organized and easy to follow.

Then you have a couple options, focus on gathering more examples for the labels with small sample size, or try to oversample your dataset. This article is a good place to start https://towardsdatascience.com/4-ways-to-improve-class-imbal...

Considering this problem of image classification is normally solved with deep learning, the more data you have, the better will be your results.


This was very helpful, thank you so much!


Computer Vision is the one domain where ML (and neural networks in particular) is the undisputed king. Unless you're in an embedded application where you can't use neural networks in which case you might want to go with handcrafted features to trade-off accuracy for speed/compute-efficiency.

With regards to the size of your dataset, there's no hard rules it depends on the complexity of the task. Problems with a high number of classes are among the more difficult ones, 100 samples per class might be enough or it might not . The only way to know for sure it to try and see if you reach a performance that's acceptable for your application.

I recommend the Pytorch framework, it's coherent, easy to use and well documented (both the API reference and the examples available on SO and throughout the net). Your problem is similar to imagenet (assuming you want to detect the presence of a class and not its position, in which case it's a different problem.) so you can try to run one of the pytorch tutorials and see how well it does. The only difference is you want to detect multiple classes in one single image so you'll have to adjust your output layer and loss potentially but the network itself could remain the same. You also might want to look into doing transfer learning with a pretrained imagenet network to speed-up the training.


Define workable model. Do you care more about recall (how many of the images with label X will be labeled by your model with label X) or precision (how many of the images where the model says have label X are actually with label X)?

It is a good fit for ML, but you need to be clear on what the results will be. If you expect 100% accuracy, that won't happen. Even 90%+ accuracy would be require a lot of effort.


If you have not at least skimmed through fast.ai, you should definitely do that, as the course itself addresses some of this and the people on their forums are among the most helpful I have ever seen!

Second, this could be more than enough! Especially if you are doing transfer learning.

Third, you can "inflate" the amount of images you have now with "Image Data Augmentation"


The answer is, "it depends". My recommendation would be to just start trying. What you describe sounds like you could train it in a couple of days. Grab an off-the-shelf resnet and just see what happens. This is a well-studied problem, you can just look at papers that train on imagenet, and then tweak their approaches.


Sounds like a problem which YOLO can solve pretty easily , i.e object detection and classification, transfer learning, etc. Try downloading a pre-trained model, and play around with that. (The other replies have outlines what you need to focus on)


How do we scale social accountability and knowing?

This is expected to enable us to solve distributed coordination problems. Also, it should facilitate richer more meaningful relationships between people.

Expected outcomes include increased thriving and economic productivity.

[edit: consider the limit on how many people you can know and the relationship between how deeply you come into relationship with that population and the size of that number]


I have spent quite a lot of time thinking about coordination in general. Indeed, knowledge is a vital part of it. The problem that I see is that knowledge is too vague and lossy and changing and incomplete [as I mentioned in this comment https://news.ycombinator.com/item?id=26203718].

An hypothetical solution would be a system that talked a language similar to plain english, but that was determinist. You let people write their problems and views to the system, and the system determines which are the widest consensus available within a given scope and what are the highest priority problems (perceived by people). This has a lot of problems, but it's a good way to think about the topic. Even with such a system, would you really be solving the problems you want to solve?

If it does, then this is basically symbolic AI. You can try to relax requirements... but you kinda need an "automatic coordinator". If you go with a manual coordinator instead, then I doubt you will be able to scale anything that's not extremely rigid and hierarchical, at which point you are re-introducing many of the same problems you were trying to fight in the first place.


A combination of "all categories are fuzzy" and "all models are wrong but some are useful"? I too doubt the effectiveness of a symbolic AI approach. Although I studied that and other approaches in the field, you may note that my background is in biologically plausible methods for pursuing artificial intelligence.

I think the direct human input method is given too much focus although it and related interactions have their place. The fallible sensors directly reporting readings from reality already has sufficient noise related issues. I suspect more richly informing people will yield better results.

I am inspired by stories such as the fish farm pollution problem [0]. Consider how a reality based game theoretic analysis of agent choices might guide your selection of future work mates (or lakes) and facilitate a different friction in finding your next contribution to the world.

[0] search "3. the fish" on https://www.lesswrong.com/posts/TxcRbCYHaeL59aY7E/meditation...


I find your comment quite confusing.

>> A combination of "all categories are fuzzy" and "all models are wrong but some are useful"?

Are you talking about my first paragraph or symbolic AI?

>> The fallible sensors directly reporting readings from reality already has sufficient noise related issues.

I assume here you are trying to say that human input is not reliable.

I don't understand what's your approach with AI here. You seem to want to use it to better inform people? How? You are going to say that human input is not reliable, but then train an AI that can't explain itself and expect people to take its advice? Either noise can be palliated at scale in both places or none.

Finally, I'm very familiar with meditations on moloch. But you seem to be betting on an "education-based" solution, which doesn't fit very well with the scenario that meditations on moloch exposes, which is not that some people couldn't make better choices (for society, the collective), but rather that the "questionable" choices of a few can deeply compromise the game for everyone else. I mean, we all probably agree that it would be great to educate people on these concepts, but I doubt that will be enough to stop the dynamics that cause it.


I apologize for the unintended confusion. I don't find all expression safe in this context and have avoided some of it as well as the amount of work I could put into describing what amounts to a ~36 year life obsession for me.

> Are you talking about my first paragraph or symbolic AI?

In the link you provided and the second paragraph of your first reply you seem, to my reading, to suggest using a system to facilitate discovering agreement on specific actions, knowledge, and tactical choices. Stated differently agreement within groups, perhaps large groups. You discussed in both comments the challenge of being specific and static, which is the downfall, in my opinion, to many symbolic systems - the presumption that our ability to discretely describe reality is sufficient. To me fuzzy categories and useful broken models comment about that finding. The systems you are describing sound useful but seem to solve a different problem than I mean to target.

> I assume here you are trying to say that human input is not reliable.

Yes, I find human output to be unreliable and I believe it is well understood to be so. An example of a system that has elements of scaling social knowing is Facebook. I believe it is well understood that people often (and statically speaking prevalently) present a facsimile of themselves there when they are presenting anything actually more than superficially adjacent to themselves at all. This introduces varying amounts of noise in to the signal and displaces participation in life, perhaps in exchange for reduced communication overhead. Humans additionally make errors on the regular, whether through "fat fingers", an unexamined self, "bias", or whatever. See also "Nosedive" [0].

> I don't understand what's your approach with AI here

I haven't really described it - the ask was literally for the problem, not for solutions. There is a certain level of vaporware in my latest notion for exactly how to solve it. As stated obliquely however, there are aspects of the solution that I don't really want to be dragged through a discussion on here on HN.

> an AI that can't explain itself

I haven't specified unexplainable AI. I actually see evidence based explainability as a key feature of my current best formulation of a concrete solution. That, in context presents quite a few nuts to crack.

> Finally, I'm very familiar with meditations on moloch

I only meant to link the fish story but the link in MoM was broken and I failed to find a backup on archive.org, not putting a whole ton of effort into looking.

Consider how the described "games" change if those willing to cooperate and achieve the maximal outcomes could preselect to only play with those who are inclined to act similarly? If you grouped the defectors and cooperators to play within their chosen strategies based on prior action? Iterated games have different solutions and I find those indicative of life, except that social accountability doesn't scale. In real life such specificity is impossible and no guarantees exist. Yet, I believe that the rights systemic support structures could solve a number of problems, including a small movement of the needle towards greater game theoretic affinity and thereby a shift in the local maxima to which we have access.

[0] https://en.wikipedia.org/wiki/Nosedive_(Black_Mirror)


Thanks, that was much clearer. Well, there are indeed many options and paths we could take in the space, so good luck with whatever you end up trying. Only one final note: I'm a very secretive person myself, and even beyond that I understand your reticence to share more details about some of your specific ideas... but I think that sharing more openly would align better with that shift in the local maxima you aspire to achieve. For example, I'm sure at least some of us would be interested in reading a submission or blog post about many of these ideas.


The question is too far up in fuzzy space. Narrow down to several use cases and specific problems within those and search field will be more manageable. Examples: Social workers want to be able to handle more cases appropriately. How many cases can they handle without diminishing quality scores. Politicians want to appear caring to the needs of as many constituents as possible. How do they group needs into buckets to find what is most relevant. Find the overlap and dig into it with more cases and then questions.


Like automating the analysis of a recorded argument according to Gottman institute and other social heuristics for augmentation of marriage counseling services?

[edit: i.e. count positive and negative sentiment statements assigned to speakers and compare the per speaker ratio to the experimentally determined minimum "healthy" ratios not yet replicated]

You're right that there needs to be a tractable starting place. This is not lost on me. I may have used a flexible definition of "close to solving" but one's interpretation also fits into the scope of the effort. I'm at least 10% into it! ;P


A way to preserve and link factual data sets.

Most reference to Wikipedia are dead links.

Many legacy media will stealth edit articles or outright delete them.

Original media files can be loss and after strange eons their authenticity will not be able to be asserted.

It will soon be impossible to distinguish from deep fakes and actual original and genuine media.

Some regimes such as Maoist China wanted to rewrite their past from scratch and erased all historical artifacts from their territory.

There are strong pressure to create an Orwellian Doublespeak to erase certain words entirely from speech, books and records. With e-books now the norm it has now become legitimate question to ask if the books are the same they were when the author published them.

Collaborative content websites have shown that they were not immune to subversive large and organized influence operations.

I have set my mind to multiple solutions (even bought a catchy sounding *.org domain name!). Obviously it will have to be distributed as to build a consensus and thus it will have to rely on hashes. But hashes alone are meaningless so some from of information will have to come along with them, which in themselves are information to authenticate with other hashes. I was thinking that the authentication value would come from individual recognized signatories. Those would be a a mesh of statements of records. For example you might not trust your government, but you might trust you grandparents and you old neighbors who all agree that there was a statue on the corner of the street and they all link to each other and maybe link to hashes of pictures and 3D scans with links. Future generations can then confirm those links with other functional URIs.

Something like blockchain technology seems an obvious choice but I have no experience with that (for now) but also there is the problem that it needs to be easily usable; therefore there is a need of a bit of a centralization (catchy domain name yay!) although any one could setup his/her own service for certain specialized subjects.

Thoughts?


One solution might be to collaborate with, build upon, and donate to an existing reputable organization like archive.org or Wikipedia which takes snapshots of websites.

Given these snapshots, you could write manual extractors or build a machine learning system [1] to extract the main content of each page as plaintext. Then load these timestamped text file snapshots into git, which will give you a hash of the content and let you easily track changes.

Push the git repo to a few places like github, bitbucket, and maybe IPFS where people can mirror it.

[1] https://joyboseroy.medium.com/an-overview-of-web-page-conten...

Alternatively you could use the built-in reader mode in Firefox or Chrome which do this automatically, but then you'd have to figure out how to maintain stability of the extraction algorithm between new browser releases.


I'm working on a prototype that uses the compositional game theory [1] and adapts it to be able to reliably predict the order complexity of functors and their differences between states.

A huge bonus there would be when the order difference can be represented in a graph, so that tesselation or other approaches like a hypercube representation can be used for quick estimations. (that's what I'm aiming for right now)

If successful, the next step would be to integrate it into my web browser so that I can try out whether the equilibrium works as expected on some niche topics or forums.

[1] https://arxiv.org/abs/1603.04641


Yeah, this week I've restarted my scanning tunnel microscope that I'm failing to make work for years... The current one is a standard pair of long metal bars with the piezoelectric component on one end, with 2 screws, and a 3rd screw on the other end.

My problem is that it doesn't matter how I design the thing, either the screws offer too little precision so I can't help but to crush the tip into the sample every time, or too little travel distance so I can't help but crush the tip into the sample when adjusting the coarser screws near the tip. This is the kind of thing that looks like a non-problem on the web, because everybody just ignores this step.


It sounds like you need to add a "stage" to help position your sample. Flexures are systems that bend to perform motion, and can do surprising things that you can't do with joined together machined pieces.

Here's an open source stage project using flexures that will likely help

https://openflexure.org/projects/blockstage/

Also, see Dan Gelbart's 18 video series about building prototypes

https://www.youtube.com/watch?v=xMP_AfiNlX4


I'm a fan of this project, but flexures are antithetical to an STM assembly. And STM needs very rigid components, the smallest vibration can interact with the height adjustment and push the tip into the sample.

But it's a great assembly for anything that doesn't have a feedback on the positioning.


Your STM is missing a proper approach mechanism. The vertical range of the piezo that is used for scanning will only be a few 100 nanometer. A screw is too coarse for that! Stick-slip mechanisms with a ramp (Besocke or Beatle-type STM https://www.researchgate.net/figure/1-Diagram-of-the-Besocke... or page 25 of https://www.bgc-jena.mpg.de/bgc-mdi/uploads/Main/MoffatDiplo...) are one solution. Even with such a mechanism, the 'approach' phase takes many minutes!


Adding another step (not a problem, just maybe a solution), a quick estimate says that if I place a few kHz signal at the sample, it will induce enough current at the tip to be detected by the preamp when the distance reaches the micrometers. That's the same range you want to stop the approximation, so it may be a nice proximity signal.

I've got to try this on my next attempt.


Sounds good!

In the future, if you'd like even more precise measurements, theoretically you could use 2 different frequencies or a reflected source, and look at the interference or superposition of the waves.

I'm by no means an expert in this, but I've heard that optical measurement (eg: laser + Michelson interferometer) could theoretically take you down to the nanometer range.

But it's easy to go overboard with this, haha.

https://www.osapublishing.org/oe/fulltext.cfm?uri=oe-20-5-56...

https://iopscience.iop.org/article/10.1088/0957-0233/9/7/004


Wow, this sounds very ambitious!

Perhaps you could somehow attach the piezoelectric component or bars to a micrometer [1] which is designed for accurate and repeatable measurement?

[1] https://en.wikipedia.org/wiki/Micrometer


Yes, I could. I believe the most straight forward design would be 3 micrometers directly supporting the piezoelectric component, with no levers.

Yet, they are a bit expensive. I'm still not willing to budget all that, but I'm starting to consider it.


Digital micrometers are expensive, but you can get analog ones on AliExpress [1] or other places for around $10. Of course, the precision may not be as good as name-brand (eg: Mitutoyo) tools.

[1] https://www.aliexpress.com/wholesale?catId=0&initiative_id=S...


Yes! Thanks a lot.

Looks like the exact thing I need are micrometer heads, and some even come with nice threaded mounts.


One more idea, there exist worm drive micrometers which allow you to step down the linear movement per revolution even more:

https://www.global-optosigma.com/en_jp/Catalogs/pno/?from=pa...

If you have machining/fabrication skills, it might also be possible to buy a few worm gear sets and modify your micrometer to move really slowly but precisely.


Excellent, glad that you found this to be helpful. Good luck with your project!


We are working on a totally new way to do cold fusion, our only problem is getting enough new fuel into the reactor without disturbing the running process.

Any help would be greatly appreciated.


Could you use tiny pellets fed in by a linear actuator and gravity, rotary loader, or some kind of a conveyor belt? If you want an off-the-shelf solution that's easy to reload, perhaps you could repurpose the loading mechanism of a machine gun to dispense the pellets.


What does your reactor look like? What fuel are you using? How are you getting to ignition?


Disclaimer : I know nothing about anything.

Hawking radiation ?

Laser tunnel ?

Magnetic canon ?

Centrifugal launcher ?

Vacuum diffusion ?

Electrical beam lensing ?


> Hawking radiation ?

That actually does make for a fairly efficient (better than fusion in energy per unit fuel mass) reactor design in principle, but you need a sub-solar-mass black hole (in the 1 billion - 100 billion ton range), and there's no known practical way to produce one.


Tesla valve?


Maybe the solution is in your fridge?

... Not being flippant; I find that these kind of prompts can help in thinking from a new approach.


A search engine that prioritizes ad free, tracker free sites.

Of course google can't do it. But this is a ripe for someone to step in.



Stateful, exaclty once, event processing without the operational capacity to run a proper Flink cluster. This thing needs to be dead simple, pragmatic and cheap/simple to operate and update. The only stateful part in our infra at the moment is a PG database.

We are going to start work on this in a weeks, so I'm looking for some insights/shortcuts/existing projects that will make our lives easier.

The goals is to process events from students during exams (max 2500studnets/exam = ~100k-150k events) and generate notifications for teachers. No fancy ML/AI, just logic. Latency of max 1 min.

Our current plan is to let a worker pool lock onto exams (PG lock) and pull new event every few seconds for those exams where (time > last pull & time < now - 10s). All the notifications that are generated are committed together with a serialized state of the statemachine and the ID of the last processed event. Events would just be stored in PG.

This solution is mean to be simple, be implemented in really short timeframe and be a case study for a more "proper & large scale" architecture later on.

Any tips, tricks or past experiences are much appreciated. Also, if you think our current plan sucks, please let me know.


I think you could leverage SKIP LOCKED for this - this blog post https://www.2ndquadrant.com/en/blog/what-is-select-skip-lock... explains it nicely.


I've had good experiences with PQ (https://pypi.org/project/pq/). Any event that generates a notification triggers adds an entry to the queue. Worker processes get entries from the queue. The queue is stored as another table in your database whose structure and content is managed by PQ, though you can always read/write to it if you want. PQ handles the concurrency.


(EDIT: just realised that you specifically mentioned stateful event processing, while what I describe below are two approaches for stateless, exactly-once event processing)

Having had a few cracks at this problem, in my opinion using locks is the wrong approach.

What you will want is:

* split all input data in batches (eg batches of 10k records, or periodic heartbeats every X seconds, etc)

* assign each batch a unique identifier

* when writing data to the output store, store the batch id along the data;

* when retransmitting a batch for whatever reason, reuse the same batch id and overwrite any data in the output store that matches this batch id.

Obviously this becomes more tricky when you’re dealing with eg window functions or more complex aggregations.

In this situation, I believe that an approach such as “asynchronous barrier snapshotting” works best. Every X seconds, you increment an epoch. While incrementing, you stop ingestion. Then you first tell the output source to create a checkpoint, then the input source to create a checkpoint, and once both have been checkpointed, you can continue streaming data again.

Anyway, these are two approaches I’ve used over the years that work well. Explicit locks don’t work well in distributed processing, imho.


Sounds like a change data capture problem. Consider using Debezium, my team was able to use the standalone java engine to connect to a Postgres DB and stream (within the context of the Java app, not an external kafka stream) insert/update/delete events. You could filter those events and apply your notification and other logic to the filtered events.


PG is great for this. Should handle ~100 or more events per second without much work (but set up a retention policy, and watch out for tables growing to > ~1M rows, as that will kill you during autovacuum).

You can use txid_current_snapshot() and friends to track the last "timestamp". Proper use of locks will help you avoid the complexity associated with long-lived transactions.

Exactly-once semantics can be tricky to guarantee if you do it at the wrong layer of abstraction. Sometimes building exactly-once semantics on top of at-least-once semantics is the way to go.

Kafka and rabbit MQ are both overkill under 100 events/sec. The extra ops overhead isn't worth it. Besides, with PG it'll be nice to be able to always query a couple tables to completely discern the state of the system.


Message queue (i.e. RabbitMQ) sounds like a more natural fit for your problem.

What is the peak and avg QPS you need to support? High peak QPS might force you to introduce distributed workers and makes locking impractical.

Another consideration is how much do you care about data integrity. Would it be a problem if a few messages are lost? What if a message for processed twice? What if servers lost connection to db for a few seconds? What if a whole server/db goes down?


Does Kafka not get you halfway there? It will guarantee exactly once semantics. Use MSK or Confluent cloud if you can use managed services.

It’s a more future proof than building this on top of Postgres.


I'm not sure I'm close to solving it, but I have an approach that I'd like some feedback on.

I have a corpus of text in many Indian languages, which i'd like to index and search. The twist is that I'd like to support searches in English. The problem is that there are many phonetic transliterations of the same word (e.g the Hindi word for law can either be written as "qanoon" or "kanun"), and traditional spelling correction methods don't work because of excessive edit distance.

My approach is this: Use some sequence to sequence ML technique (LSTM, GRU, ..., attention) to a query in English to the most probable translation and then use that to look it up using a standard document indexing toolkit like Lucene. (I can put together a training dataset of english transliterations of sentences to their original text)

The problem is that I'd like the corpus, the index and the model to be all on a mobile. I have a suspicion that the above method won't straightforwardly fit on a mobile (for a few Gig of corpus text), and that the inference time may be long. Is this assumption wrong?

How would you solve the problem? Would TinyML be a better approach for the inferencing part?


I'm not sure I understand the problem specification. You want to be able to search "law", and find documents containing "qanoon" or "kanun", right? How does your proposed solution handle that? It seems like the approach with ML TL -> Lucene would still only find one of the two, unless your model is written to return a set of possible transliterations. Or are you saying your approach doesn't currently solve this part of the problem, and that's one of the things you'd like input on?

Is the corpus the only data you have, i.e. do you need to use it for training and validation as well?

In terms of the size of the data, if you want to store the corpus on the phone anyway, won't the index and model be relatively small in comparison?


No, sorry for the confusion. I want to be able to type “kanun” or “qanoon”, and have it infer the Hindi word “कानून”, which is an indexed word.

It is not necessary that there is a one to one correspondence between words. Sometimes two english words may represent one hindi word, or vice versa.

I believe I can build up a decent sized training/validation set, for example from Bollywood song lyric databases written in English, and mapping them to the Hindi equivalent (or Tamil, Bengali, etc).

As for your last question, I dont know, since I haver implemented an ML model in practice. I saw a tutorial on Bert this morning, where a word has 768 features. That itself sounds huge, leave alone the model itself.


A non ml way to approach this is to use phonetic distance, e.g. qanoon and kanun sound the same so they are close.

There is an algorithm called Soundex with python implementations you can try.


Refer what I said about edit distance. Traditional methods don't work well at all.

The fundamental difference between my problem and traditional spelling correction algos is that in the latter, there is a canonical correct spelling to be used as a reference. In my problem, there isn't. There are different approximate ways of spelling out most hindi words ... there is no one correct way. There are common patterns, sure, but it is too tedious to encode all the variations.


Working to enable users of https://www.DreamList.com to record audio of any length and see it transcribed, ideally at the same time as the recording, while the recording is also saved. The goal is for grandparents to save stories for loved ones and not worry about quality of the transcription - just talk. When the recording is saved, the transcription can be redone or tweaked if needed later, but the memory is not lost. DreamList is web and soon native apps, so WebRTC connected to a cloud transcription service is my first instinct, but there are benefits to native iOS apis as well - especially being able to keep share stories while listening to other streams also on iOS (families talking and digging into stories together). What architecture/transcription approaches would you suggest? Any gotchas you've seen dealing with similar problems (accuracy given accents, do we train our own transcription based on gathered data, etc)?


I worked on this for a couple years during a previous startup attempt.

I designed a custom STT model via Kaldi [0] and hosted it using a modified version of this server [1]. I deployed it to a 4GB EC2 instance with configurable docker layers (one for core utils, one for speech utils, one for the model) so we could spin up as many servers as we needed for each language.

I would recommend the WebRTC or Gstreamer approach, but I wouldn't recommend trying to build your own model. It's really hard. Google's Cloud API [2] works well across lots of accents and the price is honestly about the same as running your own server. If you want to host your own STT (for privacy or whatever), I'd recommend using Coqui [3] (from the guys that ran Mozilla's OpenSpeech program). Note that this will likely be much, much worse on accents than Google's model.

[0]: https://kaldi-asr.org/

[1]: https://github.com/alumae/kaldi-gstreamer-server

[2]: https://cloud.google.com/speech-to-text

[3]: https://coqui.ai/code

Edit: Forgot to mention, there's also a YC company called Deepgram that provides ASR/STT as a service, you could give them a shot: https://deepgram.com/


In my experience, Google's API completely fails when any slightly unusual vocabulary is involved (e.g. in this instance, grandparents talking about their past jobs), and tends to just silently skip over things. Amazon's wasn't much better with vocab., but at least didn't leave things out, so you could see problems. I don't have experience with any of these others, but I think for my purposes (subtitles for maths education videos) no one will have made an appropriate model yet.


I too am missing my 1990s forums experience. This feeling, and a particularly frustrating few minutes spent on LinkedIn prompted me to write something about it.

I discuss some intellectual problems and solutions.

https://blog.eutopian.io/building-a-better-linkedin/


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: