Hacker News new | past | comments | ask | show | jobs | submit login

The first issue I can see with that code is it's not doing what he expects. He does this to read the file into a StringBuffer:

  bf.lines().forEach(s -> sb.append(s));
However, this ends up reading all the lines into one giant line, since the String's that lines() produces have the newline character stripped. This leads to the second lines() call to read a 23MB line (the file produced by gen.py). This is less than optimal.

The fastest version I managed to write was:

    public void readString5(String data) throws IOException {
         int lastIdx = 0;
         for (int idx = data.indexOf('\n'); idx > -1; idx = data.indexOf('\n', lastIdx))
         {
            parseLine(data.substring(lastIdx, idx));
            lastIdx = idx+1;
         }
         parseLine(data.substring(lastIdx));
    }
Not the prettiest thing, but it went from 0.594 GB/s to 1.047 GB/s. Also, it doesn't quite do the same as the lines() method, but that's easily changed.



Your version still has the problem that since Java 7 the substring method allocates a new String, which in unnecessary when it is only borrowed to the parseLine function and not needed afterwards. What you probably want is a view that reuses the same underlying data. With the following I get around 2 GB/s:

    public void readString(String data) throws IOException {
        int lastIdx = 0;
        for (int idx = data.indexOf('\n'); idx > -1; idx = data.indexOf('\n', lastIdx)) {
            parseLine(subSequenceView(data, lastIdx, idx));
            lastIdx = idx + 1;
        }
        parseLine(subSequenceView(data, lastIdx, data.length()));
    }

    CharSequence subSequenceView(CharSequence base, int beginIndex, int endIndex) {
        return new StringView(base, beginIndex, endIndex - beginIndex);
    }

    static class StringView implements CharSequence {
        final CharSequence base;
        final int offset;
        final int length;

        StringView(CharSequence base, int offset, int length) {
            if (length < 0) throw new IllegalArgumentException("length must be >= 0");
            this.base = base;
            this.offset = offset;
            this.length = length;
        }

        @Override
        public char charAt(int n) {
            if (n < 0 || n >= length)
                throw new IndexOutOfBoundsException(n);
            return base.charAt(offset + n);
        }

        @Override
        public int length() {
            return length;
        }

        @Override
        public CharSequence subSequence(int beginIndex, int endIndex) {
            return new StringView(base, offset + beginIndex, endIndex - beginIndex);
        }
    }


Is it possible he edited the code? I don’t see the append in the blog post.


this still does a fair amount of processing, curious what the throughput is with just straight reads into a char[] from FileReader


I'm confused.

Where is the second lines call?



What the heck?

You're right, the hidden scanFile()...makes no sense.

It's stripping all the new lines from the input.

Why? What is going on?


He's using it to read the entire files contents into memory, and re-running the lines() method from the now in-memory buffer, presumably to remove any disk I/O from bottle necking the lines() function.

Not the best method to use, probably better to use the Files.readString method.


That's a pretty big error if you're correct. What does it say about the language when a CS professor falls for this on a 40 line file?


Getting benchmarks right is difficult, even for a CS professor. The language doesn't even enter there.

What does it say about you that you're asking a leading question in this way?


I'm always interested in thoughts about language design and programming in general. We have fifty+ years of commercial software development and it still sucks. This caught my eye as the kind of error that shouldn't even be allowed to happen.


There are several simple data transformation steps. There's nothing inherently wrong with these steps. The compiler can't read your mind and guess what you wanted to do.

How would this be possible to be prevented with any programming language, existing or hypothetical?

Even if the meta information that the appended string doesn't contain any newlines would be passed on, the second call to lines() would rather use is as an optimization hint instead of raising an error.


To give an example: if the lines() method returned instances of a Line class, the string buffer and it's append() method would be aware and still create a collection of lines instead of one long string. Not necessarily nice and would probably create issues in many other interfaces, but problems like this _can_ be solved with careful language and API design. Note that his end goal was to 'read lines' and not use any particular string abstractions. Nothing to get worked up about here.


That would be a very clunky approach to put it mildly - the explicit design intent of BufferedReader is to let you read lines without caring what the line separator is - it's a lossy transformation. You also can't subclass String. You'd be adding something very unwieldy for an edge case that is directly contrary to the purpose of the API. The 'mistake' (such as it is) of the author here is simply using the wrong API - there are shorter, more direct ways to read a file into a string without any of these contortions.

    data = new String(Files.readAllBytes(Paths.get(filename)));


That still comes down to an understanding of the API.

I don't think that appending a SingleLineString to a String object would obviously add a newline. The String/StringBuffer/StringBuilder classes in Java certainly don't work that way.

If you were adding to some other object like an instance of some text representation class, this could be different. But it would still not be 100% safe to deduce just by the API itself without looking at the documentation.


About as much as binary search being wrong in many languages for 20 years: https://thebittheories.com/the-curious-case-of-binary-search...

Nothing. Mistakes happen.


Pray tell...what has Java done poorly to ensure this mistake happened?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: