Hacker News new | past | comments | ask | show | jobs | submit login
Chunking strings in Elixir: how difficult can it be? (ochagavia.nl)
80 points by wofo on Jan 4, 2023 | hide | past | favorite | 21 comments



> It is possible, however, to craft a malicious grapheme cluster spanning kilobytes (see this SO answer). What to do about that? Well, if we assume that all grapheme clusters that are too big are malicious, then there is no harm in splitting them wherever we want. The chunked messages will contain gibberish in the place where the original grapheme cluster was, but at least we can reliably deliver the rest of the message. If you have a better idea, let me know!

IMO, the appropriate response to something like this is to drop the message in its entirety, taking the classic "let it crash" approach for Erlang/Elixir. No need to pass the known to be malicious message along to potentially less fault tolerant parts of your system.


Agreed. If you discover something is off, just abort early, don't try to save it.

One classic exploit I've seen being possible multiple times is when you allow a subset of html and remove other tags. A naive approach will for instance remove <script>. But what happen if I then send in <scr<script>ipt>? If it only does one pass, I fooled it. Instead, just assume the input is done in bad faith when it's encountered and crash, instead of kneading it to something acceptable.


This seems way more error prone than just using an HTML parser and removing script tags.

If you use a syntactical parser, you can guarantee your output is safe, all the time. If you're using non-syntactical techniques, then you have to go and find all these edge cases that you should crash on in the first place.


Just for anyone reading this: there is no safe way to de-fang html with a blacklist. You can only safely allow certain tags and attributes. Many people have fallen on this sword.


In a browser context, the built-in DOMParser (perhaps wrapped in dompurify) is the way to do this. No need to look elsewhere.


If you find this topic interesting you might also like the following talk

ElixirConf 2019 - High Performance String Processing Scripts in Elixir - Johanna Larsson

https://www.youtube.com/watch?v=Y83p_VsvRFA


For those who won't watch the videos, Johanna starts with a string splitting task that takes 4 seconds in C, 9 in Python and 17 in Ruby and starts with a naive approach in Elixir that takes 120s. She eventually comes to a parallel approach using Task.async that trades memory usage for speed and executes in 7s.


> During development, I noticed that the unicode_string library raises an exception when trying to get the next word if there is invalid UTF-8 anywhere in the string. That seemed weird to me, because I would expect the library to only scan the part of the string that is relevant to get the next word. If the library is checking the whole string for UTF-8 validity every time you want to get the length of the next word, that will lead to non-linear algorithmic complexity.

I wonder: could the algorithm implemented in unicode_string be improved? Or were the algorithms implemented as Rust NIFs also prone to this non-linear complexity?


I think the answer is yes. See my comment here: https://news.ycombinator.com/item?id=34253876

They are repeatedly calling Regex.split on the trailing part of the string as they move across the string. Regex.split finds all the matches in the string so runs in linear time so you get a quadratic algorithm. The elixir/erlang regex implementation does not check if the string is valid utf8 before running the regex. They were getting the error because the whole string was being evaluated by the regex engine. It probably won't be as fast as the rust solution but it won't be horribly slow on large strings.


I don't know exactly how the word breaking algorithm works under the hood, but the Rust library I am using seems to have linear complexity, so it should be possible to improve unicode_string (if someone is willing to dedicate the time)


As the author of bstr and also the regex implementation that bstr uses to implement word breaking, it is linear time. It deals with invalid UTF-8 as it sees it. When invalid UTF-8 is encountered, it is treated as if it were the replacement codepoint via the "substitution of maximal subparts" strategy. See: https://docs.rs/bstr/latest/bstr/#handling-of-invalid-utf-8

NSFL regex that implements word breaking: https://github.com/BurntSushi/bstr/blob/86947727666d7b21c97e...


Thanks for clarifying. And thanks for the article, it was a great read!


It's an interesting discussion but do you need to chunk strings for transmission into displayable substrings or is it sufficient to assume all-or-nothing or at least in-order reassembly of the string at the other end? Choosing to treat it as chunked binary can eliminate a lot of complexity.


I'm assuming the use case is to send messages back to the user on a platform like Whatsapp after the CSR writes a few paragraphs in their help desk software. This seems like the final step before making the API calls to actually send those messages to the user.

I agree, that in a scenario where you control both ends of transmission, a boring binary chunk would be easier and simpler.


I agree with both of you :). Sending an opaque binary blob and reassembling it at the other side would have been ideal... but with some of the third-party systems we integrate with it is not possible


This could be sped up significantly by implementing a fast path for ASCII strings. Most inputs would be ascii (even assuming the amount of emoji people use these days). You'd want to check with real-world data if its worth it though.

To conclude that the performance data shows exponential complexity is erroneous: the data sent to Wolfram Alpha doubles with each data-point because the input size is doubling, but the x-axis info isn't being supplied to Alpha. Eyeballing the data, the last 2 data points are 5.85 and 11.8, which is very close to double (11.7 vs 11.8), suggesting linear complexity.


Just to clarify, 5.85 ms and 11.8 ms are the data points of the baseline, so complexity is indeed linear there. The Elixir implementation, though, runs in 225.4 ms (39x slower) and 664 ms (56x slower).

Thanks for pointing out the mistake in the data sent to Wolfram Alpha! Still, if you look at the results table, you can see that the factor of increase is ~2x at the beginning, but is ~3x at the end.

(This is why I wanted a plot instead of a table in the blog post :p, because then the curve speaks for itself)


Ok, thanks for that. I misread that baseline was the Elixir version. I thought that usually "baseline" is the reference implementation that you start with and the slowest?

Interesting topic, thanks for writing the blog post!


I'm surprised at how expensive this is. The type of development I do doesn't have string processing in the hot path, so I've never benchmarked these sort of unicode operations, and I don't doubt that the bstr Rust library is reasonably well optimized.

But 1us/byte checking for word boundaries - that's on the order of 10k instructions per byte! I knew that unicode was more complex than ASCII but yikes!


In unicode_string if you have a look at how the segmentation is implemented (https://github.com/elixir-unicode/unicode_string/blob/master...) it will make a lot of calls to Regex.split and it does this as it moves along the string. It will be calling Regex.split on the remaining part of the string it needs to work on a lot. The problem is Regex.split as implemented in Elixir just runs the regex on the whole string looking for all the matches even if you include a limit on the number of parts so the segmentation algorithm is going to have quadratic complexity. Also, when there are unicode ranges in the regular expression it runs a lot slower. However, the erlang re engine is not validating whether the binary is valid utf8 or not before running the regular expression matching. These errors they were getting was because Regex.split() causes the erlang re engine to match the whole string.

For example:

    iex(86)> v = "£AXAX" <> String.duplicate("A", 10_000) <> "\xFF"; {:ok, r} = :re.compile("[£-¨]"); :timer.tc(fn() -> :re.run(v, r) end)
    {65, {:match, [{0, 1}]}}
Also, even though Regex.split() returns two strings that are the size of the input the function does not necessarily have to linear because erlang has fast substring slicing. Example:

    iex(47)> x = String.duplicate("A", 100_000); y = :binary.part(x, {0, 5000}); :binary.referenced_byte_size(y)
    100000
Though, its probably not safe to rely on this because :binary.part() will not always use a fast slice if the slice is not big enough. I know some erlang parsing code explicitly keeps track of indices instead of using <<"foo", rest::binary()>> idiom presumably because its faster.

Here are some timings from Regex.split() that show that it is linear to the size of the input. You can see there are some very big constant costs that hide that Regex.split() is linear when not using unicode ranges.

    iex(25)> v = "AXAX" <> String.duplicate("A", 1); r = ~r/[£-¨]/; :timer.tc(fn() -> Regex.split(r, v, parts: 2) end)
    {31, ["AXAXA"]}

    iex(26)> v = "AXAX" <> String.duplicate("A", 10_000); r = ~r/[£-¨]/; :timer.tc(fn() -> Regex.split(r, v, parts: 2) end)
    {232, ...}


    iex(37)> v = "AXAX" <> String.duplicate("A", 1); r = ~r/X/; :timer.tc(fn() -> Regex.split(r, v, parts: 2) end)
    {39, ["A", "AXA"]}

    iex(29)> v = "AXAX" <> String.duplicate("A", 10_000); r = ~r/X/; :timer.tc(fn() -> Regex.split(r, v, parts: 2) end)
    {38, ...}

    iex(38)> v = "AXAX" <> String.duplicate("A", 100_000); r = ~r/X/; :timer.tc(fn() -> Regex.split(r, v, parts: 2) end)
    {274, ..}


Too difficult obviously, as the author rewrote the core function in rust... m(




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: