The first issue I can see with that code is it's not doing what he expects. He d...

peq · on July 27, 2019

Your version still has the problem that since Java 7 the substring method allocates a new String, which in unnecessary when it is only borrowed to the parseLine function and not needed afterwards. What you probably want is a view that reuses the same underlying data. With the following I get around 2 GB/s:

    public void readString(String data) throws IOException {
        int lastIdx = 0;
        for (int idx = data.indexOf('\n'); idx > -1; idx = data.indexOf('\n', lastIdx)) {
            parseLine(subSequenceView(data, lastIdx, idx));
            lastIdx = idx + 1;
        }
        parseLine(subSequenceView(data, lastIdx, data.length()));
    }

    CharSequence subSequenceView(CharSequence base, int beginIndex, int endIndex) {
        return new StringView(base, beginIndex, endIndex - beginIndex);
    }

    static class StringView implements CharSequence {
        final CharSequence base;
        final int offset;
        final int length;

        StringView(CharSequence base, int offset, int length) {
            if (length < 0) throw new IllegalArgumentException("length must be >= 0");
            this.base = base;
            this.offset = offset;
            this.length = length;
        }

        @Override
        public char charAt(int n) {
            if (n < 0 || n >= length)
                throw new IndexOutOfBoundsException(n);
            return base.charAt(offset + n);
        }

        @Override
        public int length() {
            return length;
        }

        @Override
        public CharSequence subSequence(int beginIndex, int endIndex) {
            return new StringView(base, offset + beginIndex, endIndex - beginIndex);
        }
    }

hinkley · on July 27, 2019

Is it possible he edited the code? I don’t see the append in the blog post.

soup10 · on July 27, 2019

this still does a fair amount of processing, curious what the throughput is with just straight reads into a char[] from FileReader

paulddraper · on July 27, 2019

I'm confused.

Where is the second lines call?

ricardobeat · on July 27, 2019

Lines 19 and 27 in the source: https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/...

paulddraper · on July 27, 2019

What the heck?

You're right, the hidden scanFile()...makes no sense.

It's stripping all the new lines from the input.

Why? What is going on?

elFarto · on July 27, 2019

He's using it to read the entire files contents into memory, and re-running the lines() method from the now in-memory buffer, presumably to remove any disk I/O from bottle necking the lines() function.

Not the best method to use, probably better to use the Files.readString method.

ricardobeat · on July 27, 2019

That's a pretty big error if you're correct. What does it say about the language when a CS professor falls for this on a 40 line file?

bhaak · on July 27, 2019

Getting benchmarks right is difficult, even for a CS professor. The language doesn't even enter there.

What does it say about you that you're asking a leading question in this way?

ricardobeat · on July 27, 2019

I'm always interested in thoughts about language design and programming in general. We have fifty+ years of commercial software development and it still sucks. This caught my eye as the kind of error that shouldn't even be allowed to happen.

bhaak · on July 27, 2019

There are several simple data transformation steps. There's nothing inherently wrong with these steps. The compiler can't read your mind and guess what you wanted to do.

How would this be possible to be prevented with any programming language, existing or hypothetical?

Even if the meta information that the appended string doesn't contain any newlines would be passed on, the second call to lines() would rather use is as an optimization hint instead of raising an error.

ricardobeat · on July 27, 2019

To give an example: if the lines() method returned instances of a Line class, the string buffer and it's append() method would be aware and still create a collection of lines instead of one long string. Not necessarily nice and would probably create issues in many other interfaces, but problems like this _can_ be solved with careful language and API design. Note that his end goal was to 'read lines' and not use any particular string abstractions. Nothing to get worked up about here.

pvg · on July 28, 2019

That would be a very clunky approach to put it mildly - the explicit design intent of BufferedReader is to let you read lines without caring what the line separator is - it's a lossy transformation. You also can't subclass String. You'd be adding something very unwieldy for an edge case that is directly contrary to the purpose of the API. The 'mistake' (such as it is) of the author here is simply using the wrong API - there are shorter, more direct ways to read a file into a string without any of these contortions.

    data = new String(Files.readAllBytes(Paths.get(filename)));

bhaak · on July 28, 2019

That still comes down to an understanding of the API.

I don't think that appending a SingleLineString to a String object would obviously add a newline. The String/StringBuffer/StringBuilder classes in Java certainly don't work that way.

If you were adding to some other object like an instance of some text representation class, this could be different. But it would still not be 100% safe to deduce just by the API itself without looking at the documentation.

t-writescode · on July 27, 2019

About as much as binary search being wrong in many languages for 20 years: https://thebittheories.com/the-curious-case-of-binary-search...

Nothing. Mistakes happen.

paulddraper · on July 27, 2019

Pray tell...what has Java done poorly to ensure this mistake happened?