Hacker News new | comments | ask | show | jobs | submit login
Show HN: SuperString – A fast and memory-optimized string library for C++ (github.com)
52 points by btwael 9 months ago | hide | past | web | favorite | 28 comments

To the author - I understand that you're excited about this. That's awesome, and I'm sorry if I poked too much fun at your code in some other comments. Please take this as a learning experience and please keep trying to create new things - but please also label your code as not production ready when it isn't production ready.

Some high level things you may want to learn: the "mutable" keyword (to stop having to cast "this", "const_cast" (when you actually do need to cast "this"), "enum" (so that you don't have to use "Bool" in the way you do), smart pointers (reference counting is hard, you should let the compiler take care of it for you in this case), "delete []" (make sure to always pair it with "new []").

I appreciate the effort, but as a C++ guy, code like this scares the hell out of me. Just a quick glance and I already have the impression that the code is full of memory leaks, UB, and bad code reuse. Plus, it has really bizarre stuff like creating its own bool type and no use of the standard library besides iostreams (lol iostreams).

I also do not believe the benchmarks to be meaningful.

This. The UTF-8 decoding is broken, and I didn't look at anything beyond that, resulting in a heap buffer overflow with untrusted input.

And here is a summary of the warnings I get when compiling with "-Weverything":

   26 [-Wcast-align]
   20 [-Wcast-qual]
    6 [-Wconditional-uninitialized]
   24 [-Wdollar-in-identifier-extension]
  172 [-Wold-style-cast]
    2 [-Wshorten-64-to-32]
    3 [-Wsign-compare]
   25 [-Wsign-conversion]
    3 [-Wunused-parameter]
   72 [-Wzero-as-null-pointer-constant]
Please don't use this code.

Actually, the UTF-8 decoding is broken even with proper input. This https://github.com/btwael/SuperString/blob/master/src/SuperS... will make `pointer` and `i` be out of sync, so everything after the first multibyte sequence is a coin toss.

To the OP: you should use theses tests to validate your implementation https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt (caveat: 5- and 6-byte sequences are no longer valid UTF-8 and should be rejected).

Well, unless you have a Mac with the user "btwael", the benchmark might prove tough to use as it's hardcoded to "/Users/btwael/Downloads/longtextfile.txt". Sadly, this file isn't included in the repo.

With 2280 lines, a unit test or two would make me way less apprehensive.

For the benefit of those not versed in C++ - this is probably a joke library :) - like the Sokal hoax where he submitted a paper of total gibberish to a humanist journal to prove that the medium is ready to accept any vacuous claims as long as they fall into a particular mould[0].

If this is not obvious to you, start to think about the numbers in the benchmark, and remember it's really friggin easy to do silly errors in C++ with massive consequences on runtime and resource consumption. So, in real life, if you do something utilizing the basic library and the numbers are way off, debug, re-read the code, benchmark. Step through the code in the debugger. Litter it with printfs and so on. If a novice, read through stackoverflow and other resources, and go through the manual pages of the suggested implementation, and understand what is going on under the hood (try to translate the solution into C in your head).

Anything but careening away to create your own custom library.

Comments in this thread provided several explicit things wrong with this already.

[0] https://en.wikipedia.org/wiki/Sokal_affair

Side note: FB also developed a string library to replace std::string (https://github.com/facebook/folly/blob/master/folly/FBString...).

Here's a video from explaining the motivation behind the project (https://www.youtube.com/watch?v=kPR8h4-qZdk).

There are lots of great libraries out there that don't get much attention, it's a shame that something like this, which is total nonsense, ends up getting attention.

What is the reason for having your own Bool type in C++?

And why the redefinition of NULL? The author of this library is not experienced with C++.

Nothing wrong with writing something and sharing it in order to learn, though! Just surprised to see it on the front page.

Just in case I guess. :)

  +// in case NULL is not defined
  +#ifndef NULL
  +#define NULL (0L)
  +#endif // NULL

   return this->_lengthComputed == ((Bool) 2);
Obviously so that you can add a third value :) :)

Looking at the commit history on the project [0] (and inadvertently digging a bit deeper into the OP's StackOverflow reply history to notice that all are advertising his non-production ready projects) certainly gives a similar vibe. :)

[0] https://github.com/btwael/SuperString/commits/master

Oh man...

How is the std string version taking over 450 MB for two copies (original and split into lines) of a 210 kB file?

Probably because his benchmark is broken. It always copies to the end of the file.

https://github.com/btwael/SuperString/blob/master/test/withS... this line should be ` lines.push_back(string.substr(last, i - last));` (it also drops the very last line, but who's counting, eh?).

If the 210kb file has a line every 80 bytes or so and is copied this way, it works out to around 490MB of data (caveat: my math may be wrong).

In that case it's closer to 50 bytes I think. (Average wrong line length is 105k, as first line is 210k and last line is 0k, so number of lines is 450M / 105k (about 4300), so lines would be 210 / 4.3 = about 50 bytes.)

  for(Size i = 0; i < this->_memoryLength; i++) {
        *(this->_data + i) = *(sequence->_bytes + i);
I'd rather use std::memcpy.

or even `std::copy_n(sequence->_bytes, _memoryLength, _data);`. It will produce exactly the same ASM but is more legible.

Can anyone comment if reimplementing it in C# would bring performance benefits over system.string?

Based on a very quick look at the source code here and the source code of System.String [1], it looks like the .NET implementation should be faster and more comprehensive. It would be interesting to see a deeper analysis and comparison of these two.

[1]: https://github.com/Microsoft/referencesource/blob/master/msc...

.NET already has a great string implementation in both the Framework and Core runtimes. If you're talking about manipulating strings in C# itself then the upcoming Span<T> and Memory<T> types will help greatly. Some of the new sockets implementations are already using these new interfaces.

As far as I remember, Federico Lois in his Patterns for High Performance C#, described that in RavenDB they had to roll out they own string type for performance reasons. So I suppose there is still room for improvement in .NET when it comes to string.

edit: link to the talk: https://www.youtube.com/watch?v=7GTpwgsmHgU

Looks cool

(*remarkable, not remarquable by the way)

What's wrong with remarquable?

I’m guessing because it is a French word, and thus stands out as an incorrect choice in the context of an English-language readme—particularly since it isn’t set off in italics to signify an intentionally chosen French word.

Nothing except it's a French word.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact