
Show HN: Alphareader: a custom separator and endline file reader in Python - canimus
https://github.com/canimus/alphareader
======
eesmith
Some tweaks:

#1:

    
    
            elif not isinstance(fn_transform, FunctionType) or not isinstance(fn_transform, LambdaType):
                raise TypeError('Transformation parameter should be a function or lambda i.e. fn = lambda x: x.replace(a,b)')
    

What about using callable()? There's no reason you couldn't use a functor, for
example.

#2:

    
    
            curr = file_handle.read(chunk_size)
            if encoding:
                curr = curr.decode(encoding)
    

That assumes a single-byte encoding. Consider a multi-byte encoding where the
chunk_size reads only part of the character:

    
    
        >>> s="ü"
        >>> s.encode("utf8")[:1].decode("utf8")
        Traceback (most recent call last):
          File "<stdin>", line 1, in <module>
        UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 0: unexpected end of data
    

#3:

    
    
            if chr(terminator) in chunk:
                lines = chunk.split(chr(terminator))
    

Might want to compute chr(terminator) once, rather than re-evaluate it each
time.

#4:

    
    
            try:
                transformations = iter(fn_transform)
                yield list(map(lambda x: reduce(lambda a,b: b(a), fn_transform, x), columns))
            except TypeError:
                yield list(map(fn_transform, columns))
    

Since you've already checked for the two cases, set a flag to remember what
fn_transform contains. Then branch on that, rather than use the try/except.

Otherwise, consider what happens if one of the callables raises a TypeError
because of an internal error, rather than because of an expected structural
mismatch.

~~~
canimus
Thank you @eesmith. Comments appreciated, and PRs to the repo as well. ;-) The
multi-byte is great catch! I made the wrong assumption, on single byte
separators. Perhaps a library limitation if the we want to keep the logic
simple. Ideas on the fix?

~~~
eesmith
If it's a fixed-width encoding, nudge the read size to a multiple of that
encoding size.

If it's utf-8, keep the block reads in byte space, search for the terminator
as a byte sequence, and only decode _after_ you find the terminator.

Otherwise, throw your hands up in the air and give up?

Catch the UnicodeDecodeError, use err.start, and see if it's close to the end
of the block? If it is, then do another read?

BTW, you can mitigate some Python overhead by using a larger read size.

