Counting Lines of a Text File, the Smart Way

anon29821626 · 24 June 2020 14:45

Hello,

I found this educational C# blog post about reading the line count of a text file: Counting Lines of a Text File, the Smart Way. At the end a very efficient algorithm is described.

Would someone please help me to translate to Xojo?

I am already failing with the BinaryStream translations:

(bytesRead = stream.Read(byteBuffer, 0, byteBuffer.Length)) > 0

Xojo’s BinaryStream.Read methods either have no parameter or expect a string.

public static long CountLinesMaybe(Stream stream)  
{
    Ensure.NotNull(stream, nameof(stream));

    var lineCount = 0L;

    var byteBuffer = new byte[1024 * 1024]; // New MemoryBlock(1024 * 1024) ?
    const int BytesAtTheTime = 4;
    var detectedEOL = NULL; // Which data type? Variant?
    var currentChar = NULL;  // Which data type? Variant?

    int bytesRead;
    while ((bytesRead = stream.Read(byteBuffer, 0, byteBuffer.Length)) > 0)
    {
        var i = 0;
        for (; i <= bytesRead - BytesAtTheTime; i += BytesAtTheTime)
        {
            currentChar = (char)byteBuffer[i];

            if (detectedEOL != NULL)
            {
                if (currentChar == detectedEOL) { lineCount++; }

                currentChar = (char)byteBuffer[i + 1];
                if (currentChar == detectedEOL) { lineCount++; }

                currentChar = (char)byteBuffer[i + 2];
                if (currentChar == detectedEOL) { lineCount++; }

                currentChar = (char)byteBuffer[i + 3];
                if (currentChar == detectedEOL) { lineCount++; }
            }
            else
            {
                if (currentChar == LF || currentChar == CR)
                {
                    detectedEOL = currentChar;
                    lineCount++;
                }
                i -= BytesAtTheTime - 1;
            }
        }

        for (; i < bytesRead; i++)
        {
            currentChar = (char)byteBuffer[i];

            if (detectedEOL != NULL)
            {
                if (currentChar == detectedEOL) { lineCount++; }
            }
            else
            {
                if (currentChar == LF || currentChar == CR)
                {
                    detectedEOL = currentChar;
                    lineCount++;
                }
            }
        }
    }

    if (currentChar != LF && currentChar != CR && currentChar != NULL)
    {
        lineCount++;
    }
    return lineCount;
}

npalardy · 24 June 2020 15:18

curious algorithm as really all it does is read the file in “chunks” then examine every byte (in the first case) for being a CR, LF, or CR/LF pair

it handles the case where reading a chunk splits a CR/LF
so the CR is read and is the last char in the chunk and the LF is the first char in the next

the second portion simply reads and treats things as “characters” instead of “bytes” like the first

MarkusWinter · 24 June 2020 18:35

If you want it efficient then don’t read a file in junks - load it all and split it.

I need to process huge text files and reading it all in resulted in the biggest speed-up.

anon29821626 · 24 June 2020 18:45

My experience with an extremely large text file is that this leads to a freeze or crash of the program.

npalardy · 24 June 2020 19:32

making sure the chunks are “large enough” but not so large they require more RAM than you have is crucial
The chunks in this example are 1 Mb (1024 * 1024) bytes which seems a bit on the smallish side

Havent had a chance to look at the code in more depth but C# should port reasonably well

MarkusWinter · 24 June 2020 19:48

What do you consider extremely large”

npalardy · 24 June 2020 19:52

EDIT : oh wait I dont think tat question was to me
~~Files that are many Gb in size at least~~
~~Even bbedit has issues with those and its pretty darned fast most times~~
~~This code should be able to give you a line count for any file of any size~~

What would you consider “extremely large” ?

DaveS · 24 June 2020 19:58

if this is for macOS you could try

wc -l <filename>

anon29821626 · 24 June 2020 20:05

The file I have is 700 MB (pure Text File).

MarkusWinter · 24 June 2020 20:28

That shouldn’t be too much. On my 2010 17in MacBook Pro with a 2,53 GHz i5, 8 GB of RAM, 500 GB SSD I routinely processed large proteomics files between a few hundred MB and 1.5 GB.

The problem for me was not the loading of the textual data but it’s processing as string processing is pretty slow in Xojo.

The following might be helpful:

https://www.boredomsoft.org/string-building-in-realbasic.bs

But if all you want to do is count lines then Dave’s suggestion should do fine.