
Show HN: SuperString – A fast and memory-optimized string library for C++ - btwael
https://github.com/btwael/SuperString
======
a_t48
To the author - I understand that you're excited about this. That's awesome,
and I'm sorry if I poked too much fun at your code in some other comments.
Please take this as a learning experience and please keep trying to create new
things - but please also label your code as not production ready when it isn't
production ready.

Some high level things you may want to learn: the "mutable" keyword (to stop
having to cast "this", "const_cast" (when you actually do need to cast
"this"), "enum" (so that you don't have to use "Bool" in the way you do),
smart pointers (reference counting is hard, you should let the compiler take
care of it for you in this case), "delete []" (make sure to always pair it
with "new []").

------
pubby
I appreciate the effort, but as a C++ guy, code like this scares the hell out
of me. Just a quick glance and I already have the impression that the code is
full of memory leaks, UB, and bad code reuse. Plus, it has really bizarre
stuff like creating its own bool type and no use of the standard library
besides iostreams (lol iostreams).

I also do not believe the benchmarks to be meaningful.

~~~
msim
This. The UTF-8 decoding is broken, and I didn't look at anything beyond that,
resulting in a heap buffer overflow with untrusted input.

And here is a summary of the warnings I get when compiling with
"-Weverything":

    
    
       26 [-Wcast-align]
       20 [-Wcast-qual]
        6 [-Wconditional-uninitialized]
       24 [-Wdollar-in-identifier-extension]
      172 [-Wold-style-cast]
        2 [-Wshorten-64-to-32]
        3 [-Wsign-compare]
       25 [-Wsign-conversion]
        3 [-Wunused-parameter]
       72 [-Wzero-as-null-pointer-constant]
    

Please don't use this code.

~~~
bananapeel
Actually, the UTF-8 decoding is broken _even with proper input_. This
[https://github.com/btwael/SuperString/blob/master/src/SuperS...](https://github.com/btwael/SuperString/blob/master/src/SuperString.cc#L2015)
will make `pointer` and `i` be out of sync, so everything after the first
multibyte sequence is a coin toss.

To the OP: you should use theses tests to validate your implementation
[https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt](https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt)
(caveat: 5- and 6-byte sequences are no longer valid UTF-8 and should be
rejected).

------
fsloth
For the benefit of those not versed in C++ - this is probably a joke library
:) - like the Sokal hoax where he submitted a paper of total gibberish to a
humanist journal to prove that the medium is ready to accept any vacuous
claims as long as they fall into a particular mould[0].

If this is not obvious to you, start to think about the numbers in the
benchmark, and remember it's really friggin easy to do silly errors in C++
with massive consequences on runtime and resource consumption. So, in real
life, if you do something utilizing the basic library and the numbers are way
off, debug, re-read the code, benchmark. Step through the code in the
debugger. Litter it with printfs and so on. If a novice, read through
stackoverflow and other resources, and go through the manual pages of the
suggested implementation, and understand what is going on under the hood (try
to translate the solution into C in your head).

Anything but careening away to create your own custom library.

Comments in this thread provided several explicit things wrong with this
already.

[0]
[https://en.wikipedia.org/wiki/Sokal_affair](https://en.wikipedia.org/wiki/Sokal_affair)

~~~
btwael
[http://wael.boutglay.com/2018/04/13/the-story-behind-
superst...](http://wael.boutglay.com/2018/04/13/the-story-behind-superstring/)

------
3uclid
Side note: FB also developed a string library to replace std::string
([https://github.com/facebook/folly/blob/master/folly/FBString...](https://github.com/facebook/folly/blob/master/folly/FBString.h)).

Here's a video from explaining the motivation behind the project
([https://www.youtube.com/watch?v=kPR8h4-qZdk](https://www.youtube.com/watch?v=kPR8h4-qZdk)).

------
CyberDildonics
There are lots of great libraries out there that don't get much attention,
it's a shame that something like this, which is total nonsense, ends up
getting attention.

------
aivarsk
What is the reason for having your own Bool type in C++?

~~~
a_t48

       return this->_lengthComputed == ((Bool) 2);
    

Obviously so that you can add a third value :) :)

~~~
therein
Looking at the commit history on the project [0] (and inadvertently digging a
bit deeper into the OP's StackOverflow reply history to notice that all are
advertising his non-production ready projects) certainly gives a similar vibe.
:)

[0]
[https://github.com/btwael/SuperString/commits/master](https://github.com/btwael/SuperString/commits/master)

------
tspiteri
How is the std string version taking over 450 MB for two copies (original and
split into lines) of a 210 kB file?

~~~
a_t48
Probably because his benchmark is broken. It always copies to the end of the
file.

[https://github.com/btwael/SuperString/blob/master/test/withS...](https://github.com/btwael/SuperString/blob/master/test/withStd.cc#L26)
this line should be ` lines.push_back(string.substr(last, i - last));` (it
also drops the very last line, but who's counting, eh?).

If the 210kb file has a line every 80 bytes or so and is copied this way, it
works out to around 490MB of data (caveat: my math may be wrong).

~~~
tspiteri
In that case it's closer to 50 bytes I think. (Average wrong line length is
105k, as first line is 210k and last line is 0k, so number of lines is 450M /
105k (about 4300), so lines would be 210 / 4.3 = about 50 bytes.)

------
HelloNurse

      for(Size i = 0; i < this->_memoryLength; i++) {
            *(this->_data + i) = *(sequence->_bytes + i);
      }
    

I'd rather use std::memcpy.

~~~
jcelerier
or even `std::copy_n(sequence->_bytes, _memoryLength, _data);`. It will
produce exactly the same ASM but is more legible.

------
polskibus
Can anyone comment if reimplementing it in C# would bring performance benefits
over system.string?

~~~
manigandham
.NET already has a great string implementation in both the Framework and Core
runtimes. If you're talking about manipulating strings in C# itself then the
upcoming Span<T> and Memory<T> types will help greatly. Some of the new
sockets implementations are already using these new interfaces.

~~~
polskibus
As far as I remember, Federico Lois in his Patterns for High Performance C#,
described that in RavenDB they had to roll out they own string type for
performance reasons. So I suppose there is still room for improvement in .NET
when it comes to string.

edit: link to the talk:
[https://www.youtube.com/watch?v=7GTpwgsmHgU](https://www.youtube.com/watch?v=7GTpwgsmHgU)

------
asdsa5325
Looks cool

(*remarkable, not remarquable by the way)

~~~
thesquib
What's wrong with remarquable?

~~~
bobwaycott
I’m guessing because it is a French word, and thus stands out as an incorrect
choice in the context of an English-language readme—particularly since it
isn’t set off in italics to signify an intentionally chosen French word.

