Counting Lines of a Text File, the Smart Way

Hello,

I found this educational C# blog post about reading the line count of a text file: Counting Lines of a Text File, the Smart Way. At the end a very efficient algorithm is described.

Would someone please help me to translate to Xojo?

I am already failing with the BinaryStream translations:

(bytesRead = stream.Read(byteBuffer, 0, byteBuffer.Length)) > 0

Xojo’s BinaryStream.Read methods either have no parameter or expect a string.

public static long CountLinesMaybe(Stream stream)  
{
    Ensure.NotNull(stream, nameof(stream));

    var lineCount = 0L;

    var byteBuffer = new byte[1024 * 1024]; // New MemoryBlock(1024 * 1024) ?
    const int BytesAtTheTime = 4;
    var detectedEOL = NULL; // Which data type? Variant?
    var currentChar = NULL;  // Which data type? Variant?

    int bytesRead;
    while ((bytesRead = stream.Read(byteBuffer, 0, byteBuffer.Length)) > 0)
    {
        var i = 0;
        for (; i <= bytesRead - BytesAtTheTime; i += BytesAtTheTime)
        {
            currentChar = (char)byteBuffer[i];

            if (detectedEOL != NULL)
            {
                if (currentChar == detectedEOL) { lineCount++; }

                currentChar = (char)byteBuffer[i + 1];
                if (currentChar == detectedEOL) { lineCount++; }

                currentChar = (char)byteBuffer[i + 2];
                if (currentChar == detectedEOL) { lineCount++; }

                currentChar = (char)byteBuffer[i + 3];
                if (currentChar == detectedEOL) { lineCount++; }
            }
            else
            {
                if (currentChar == LF || currentChar == CR)
                {
                    detectedEOL = currentChar;
                    lineCount++;
                }
                i -= BytesAtTheTime - 1;
            }
        }

        for (; i < bytesRead; i++)
        {
            currentChar = (char)byteBuffer[i];

            if (detectedEOL != NULL)
            {
                if (currentChar == detectedEOL) { lineCount++; }
            }
            else
            {
                if (currentChar == LF || currentChar == CR)
                {
                    detectedEOL = currentChar;
                    lineCount++;
                }
            }
        }
    }

    if (currentChar != LF && currentChar != CR && currentChar != NULL)
    {
        lineCount++;
    }
    return lineCount;
}

curious algorithm as really all it does is read the file in “chunks” then examine every byte (in the first case) for being a CR, LF, or CR/LF pair

it handles the case where reading a chunk splits a CR/LF
so the CR is read and is the last char in the chunk and the LF is the first char in the next

the second portion simply reads and treats things as “characters” instead of “bytes” like the first

If you want it efficient then don’t read a file in junks - load it all and split it.

I need to process huge text files and reading it all in resulted in the biggest speed-up.

My experience with an extremely large text file is that this leads to a freeze or crash of the program.

making sure the chunks are “large enough” but not so large they require more RAM than you have is crucial
The chunks in this example are 1 Mb (1024 * 1024) bytes which seems a bit on the smallish side

Havent had a chance to look at the code in more depth but C# should port reasonably well

What do you consider extremely large”

EDIT : oh wait I dont think tat question was to me :stuck_out_tongue:
Files that are many Gb in size at least
Even bbedit has issues with those and its pretty darned fast most times
This code should be able to give you a line count for any file of any size

What would you consider “extremely large” ?

if this is for macOS you could try

wc -l <filename>

1 Like

The file I have is 700 MB (pure Text File).

That shouldn’t be too much. On my 2010 17in MacBook Pro with a 2,53 GHz i5, 8 GB of RAM, 500 GB SSD I routinely processed large proteomics files between a few hundred MB and 1.5 GB.

The problem for me was not the loading of the textual data but it’s processing as string processing is pretty slow in Xojo.

The following might be helpful:

https://www.boredomsoft.org/string-building-in-realbasic.bs

But if all you want to do is count lines then Dave’s suggestion should do fine.