The first issue I can see with that code is it's not doing what he expects. He does this to read the file into a StringBuffer:
bf.lines().forEach(s -> sb.append(s));
However, this ends up reading all the lines into one giant line, since the String's that lines() produces have the newline character stripped. This leads to the second lines() call to read a 23MB line (the file produced by gen.py). This is less than optimal.
Not the prettiest thing, but it went from 0.594 GB/s to 1.047 GB/s. Also, it doesn't quite do the same as the lines() method, but that's easily changed.
Your version still has the problem that since Java 7 the substring method allocates a new String, which in unnecessary when it is only borrowed to the parseLine function and not needed afterwards.
What you probably want is a view that reuses the same underlying data. With the following I get around 2 GB/s:
public void readString(String data) throws IOException {
int lastIdx = 0;
for (int idx = data.indexOf('\n'); idx > -1; idx = data.indexOf('\n', lastIdx)) {
parseLine(subSequenceView(data, lastIdx, idx));
lastIdx = idx + 1;
}
parseLine(subSequenceView(data, lastIdx, data.length()));
}
CharSequence subSequenceView(CharSequence base, int beginIndex, int endIndex) {
return new StringView(base, beginIndex, endIndex - beginIndex);
}
static class StringView implements CharSequence {
final CharSequence base;
final int offset;
final int length;
StringView(CharSequence base, int offset, int length) {
if (length < 0) throw new IllegalArgumentException("length must be >= 0");
this.base = base;
this.offset = offset;
this.length = length;
}
@Override
public char charAt(int n) {
if (n < 0 || n >= length)
throw new IndexOutOfBoundsException(n);
return base.charAt(offset + n);
}
@Override
public int length() {
return length;
}
@Override
public CharSequence subSequence(int beginIndex, int endIndex) {
return new StringView(base, offset + beginIndex, endIndex - beginIndex);
}
}
He's using it to read the entire files contents into memory, and re-running the lines() method from the now in-memory buffer, presumably to remove any disk I/O from bottle necking the lines() function.
Not the best method to use, probably better to use the Files.readString method.
I'm always interested in thoughts about language design and programming in general. We have fifty+ years of commercial software development and it still sucks. This caught my eye as the kind of error that shouldn't even be allowed to happen.
There are several simple data transformation steps. There's nothing inherently wrong with these steps. The compiler can't read your mind and guess what you wanted to do.
How would this be possible to be prevented with any programming language, existing or hypothetical?
Even if the meta information that the appended string doesn't contain any newlines would be passed on, the second call to lines() would rather use is as an optimization hint instead of raising an error.
To give an example: if the lines() method returned instances of a Line class, the string buffer and it's append() method would be aware and still create a collection of lines instead of one long string. Not necessarily nice and would probably create issues in many other interfaces, but problems like this _can_ be solved with careful language and API design. Note that his end goal was to 'read lines' and not use any particular string abstractions. Nothing to get worked up about here.
That would be a very clunky approach to put it mildly - the explicit design intent of BufferedReader is to let you read lines without caring what the line separator is - it's a lossy transformation. You also can't subclass String. You'd be adding something very unwieldy for an edge case that is directly contrary to the purpose of the API. The 'mistake' (such as it is) of the author here is simply using the wrong API - there are shorter, more direct ways to read a file into a string without any of these contortions.
data = new String(Files.readAllBytes(Paths.get(filename)));
That still comes down to an understanding of the API.
I don't think that appending a SingleLineString to a String object would obviously add a newline. The String/StringBuffer/StringBuilder classes in Java certainly don't work that way.
If you were adding to some other object like an instance of some text representation class, this could be different. But it would still not be 100% safe to deduce just by the API itself without looking at the documentation.
The fastest version I managed to write was:
Not the prettiest thing, but it went from 0.594 GB/s to 1.047 GB/s. Also, it doesn't quite do the same as the lines() method, but that's easily changed.