Hacker Newsnew | past | comments | ask | show | jobs | submit | jepler's commentslogin

My mind just balks at the idea of having so much source that a 2020s computer could take hours to index it. ctags is nothing special (both in terms of optimization but also the level of detail it gets to: just global function identifiers) and looks like it runs at about 400MB/s on a single core of an i5-1235U. But still it looks ctags could process about 100TB in 4 hours across 16 threads on a workstation class CPU...


It sounds like the indexing time/complexity is increased a lot by the amount of detailed data they're storing. They mention determining which `using` statement is used to resolve each symbol reference in C++ source, to enable dead code detection; that's going to require some sophisticated analysis.


Correct, you need to build an AST representation of the code that you want to index. Essentially, it's a compiler frontend pass and which is why it takes so much longer than what ctags heuristics do. Now think millions of lines of code, multiple build configurations, the amount of RAM you need, etc. Multiple branches, or even smaller revisions/commits, is also a big computation problem.

That said, Glean seems to be reusing the indexer from LLVM/clang for C and C++.

> The C++ indexer ("the clang indexer") is a wrapper over clang. The clang indexer is a drop in replacement for the C++ compiler that emits Glean facts instead of code. The wrapper is linked against libclang and libllvm.

[1] https://glean.software/docs/indexer/cxx


The whole point of indexing data is to perform very expensive computation once and leverage the result many many times and it works really well.


It's a mono repo across a dozen languages (good luck with ctags) that tens of thousands of developers commit to every day. Even if you'd spend the hours indexing it locally, it would be out of date right away.


You kinda said it yourself already - ctags is fast because it's producing almost nothing of value. Being fast at doing nothing isn't impressive.

Try doing the same with C++ and more indexing options enabled, such as with something like universal-ctags, and a larger code base, say Android's repository aught to do it. Are you still getting 400MB/s? Nope.


According to http://comopt.ifi.uni-heidelberg.de/software/TSPLIB95/STSP.h... the optimal solution for eil51 is 426 (the linked page has a tour of 428.87) and the optimal solution for st70 is 675 (the linked page has a tour of 677.11).


What's with the 12-34-21-33 loop shown on Concorde's st70 path as shown on OP's article? You'd expect it to avoid intersecting routes in a final answer to a given problem.


I wonder whether you still have a bug. It depends how "scanning" is implemented.

    23:55 start scanning
    23:59:59 message added
    00:00:01 perform timestamp check
If there is the possibility of a message being added very late, and either being omitted from the scan that started earlier, or a delay from finishing the scan to performing the timestamp check, you might still miss a message.

You also mentioned a directory being created 5 minutes early: what if some system has a clock skewed the opposite way, and puts a message in a directory 5 minutes late instead?


assuming you mean the decimal expansion and a length-3 de bruijn sequence on 10 symbols, I am not seeing it. There seem to be many sub-sequences which do not appear. (assuming that python's Decimal library at a precision of 4000 digits is actually giving the first 1000 digits correctly; though I have no reason to doubt this I don't know for sure the maximum ULP error of exponent)

    Python 3.11.2 (main, Sep 14 2024, 03:00:30) [GCC 12.2.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> from decimal import Decimal, getcontext
    >>> getcontext().prec = 4000
    >>> d = 17 ** (1 / Decimal(7))
    >>> sd = str(d)
    >>> sd[:12]
    '1.4989198720'
    >>> s = sd[2:][:1000]
    >>> s[:10], s[-10:]
    ('4989198720', '6163659068')
    >>> [i for i in range(1000) if "%03d" % i not in s+s]
    [0, 2, 4, 5, 9, [...], 993, 994, 995, 998, 999]
    >>> len(_)
    361


id prefer it if you would write this in haskell so i can understand what is going on there better. i am saying of the first 1,000 digits after the ones place, each digit of the sequence appears an even number of times, every consecutive subsequence of two appears an even number of times, and every consecutive subsequence of three happens exactly once. please, if you could do some more intelligable work and show your result as standard deviations away from the the exact de bruijn counts i would really appreciate it. etcetcera.

(i dont know where you got the 10 symbols from. im talking about over the entire 1,000 length decimal sequence. etcetera.)


I don't know Haskell, so I can imagine if you didn't know Python it'd be "greek to you" (unintelligible). I'll try explaining in prose what I did. [note: I also edited the grandparent comment because when I simplified the session I showed, the assignment of "s" ended up out of order]

I use the Python decimal package to calculate 17^(1/7) to 4000 decimal places. I show the first 12 characters of the ASCII representation ("1.4989198720"), then take the first 1000 characters after the decimal place. I show the first and last 10 characters of this string ("4989198720...6163659068") so that you can verify it against the string you are testing.

Then I use a list comprehension to check each 3-digit string (e.g., 001, 002, 003, ..., 999) to find out whether it is a substring of the concatenation of `s` with itself, and list the ones that don't appear. I manually abbreviated the list, but I'm saying that in my string, 000, 002, 004, 005, 009, [and some other numbers], 993, 994, 995, 998, and 999 did not appear. (The concatenation `s+s` is checked so that the subsequences that occur at the wraparound---here,684 and 849---are found)

Because some of these length-3 sequences are missing, to my understanding this does not constitute a de Bruijn sequence of order 3 on a size-10 alphabet. However, I'm also not entirely sure whether this is what you were claiming.


yes. your code is what i was intending to check, after all. i got confused that the '10' might represent some kind of sub-sub string. what happened was simple, i dont have access to a good haskell environment and 'claude 3.5 sonnet' hallucinated perfectly even counts, probably because i was leading a little. i apologize for not checking claudes clearly dubious output. i got excited that it seemed the 1,000 characters i copy pasted seemed to be 'super magic'. i still see some kind of pattern in the '7' digit sequence, though, and maybe there still might be a metric that estimates this property as a function of how far into the sequence you go that might have some applicability, maybe. sorry.

(also, i didnt sleep last night and smoke weed. im on a phone.)


No downtime noted yet at https://status.python.org/ but confirmed from here


and it's back now. No incidents reported on status.python.org.

there was some discussion on at least two github issues: https://github.com/python/cpython/issues/127307 and https://github.com/python/pythondotorg/issues/2663


Too tired to find the actual debunking of these claims but no, d-wave is not effective for cryptanalysis of RSA or AES as deployed.

The papers claim some interesting but not game-changing results on factorization of tiny primes and attacking toy, reduced round cryptosystems that resemble AES in a particular way.


Anil is a given first name, maybe just not in your culture.


You don't need machine code these days to find overflow in IEEE floating point arithmetic.

If any intermediate sum overflows, you get an infinity. Further arithmetic can never return you to the realm of finite numbers (though you can also get into the world of "not a number" NaNs by calculating -inf + inf for instance), so you can use a standard function like (C99 <math.h>) isfinite() to check if any intermediate step of summing FP numbers overflowed the representation.


If you're writing for an audience outside of India you might want to explain what ₹9 lakh equates to, as both the value of ₹1 and the magnitude of the lakh are probably unfamiliar.

(Apparently that's 9x10^5 INR, or in the neighborhood of 10,000€ or $10,000. If you're curious about this system of naming and writing numbers used in India and other countries in the region, https://en.wikipedia.org/wiki/Lakh and https://en.wikipedia.org/wiki/Indian_numbering_system#Use_of... will get you started)


https://www.flightconnections.com/flights-from-iah-to-hav there are direct flights from USA to Cuba these days


Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: