The most interesting work on multimodal LLMs I have seen is from a team working on digesting CAD schematics from maritime operators for semantic search integration and it is very impressive the results let you ask for directions between engine room 123 and the exit and how many fire hoses are on the way etc. The latest thing they are playing with is benchmarking against IKIA flatpack instructions which I think is genius.
Check the feed here: https://twitter.com/hrishioa/status/1755626405239636450?t=5O...