The hoard of clutter within someone's memories

Some of it is useful

Any AI version of Marie Kondo should just give up!

Again, in the beginning of mankind, early men looked at natural, and possibly a few unnatural events. These became stories which then turned into myth. However, the reaction of a few inquisitive minds was 'I wonder how often that happens!' And thus began the birth of statistics. Some of these men derived answers and rather than being overly helpful, decided to become priests of early gods and use events over which they had no power to solidify their place in society.

Another description that is a touch overdramatic but observing events and determining their frequency in a given time frame or observing things within a group is as old as humanity. Determining 'how often?' or 'how frequent?' resulted in migration, hunting and gathering behaviours, knowing when food was abundant and when it was not and other bedrocks of the survival of early man. When civilisation started in the form of farming, observing patterns remained crucial to survival. Knowing 'how often?' or 'how much?' determined what percentage of any tribe was likely to survive another year. And a key feature of these early calculations was 'how much does each of us get?'

Before the dynamics of tribal politics come into play, the first basic calculation would be 'how much does each of us get if we were all equal?' From this number, all others would be determined regarding resources. If the distribution was too unequal, the tribe would implode. If it were too equal, those with special skills or knowledge may be tempted to strike out on their own. Juggling the right response depended upon the baseline result of 'what would the mean be for all?'.

I have seen the word 'average' abused to such a degree that it loses all meaning. To start with there is more than one average and the three taught at high school are: Mean, Mode and Median. Right now, focus is on the mean as it is often the most informative in a mathematical sense.

So here is the exercise. We have a text file with a list of numbers and it is desired to find the mean. The text files are not of set length, any amount of numbers could be within. Mistakes can be made so it would help that any non-numeric data is ignored and does not crash the program. It may seem basic but a lot of modern computing involves statistical analyses of large amounts of data, and a lot of that data is on basic text files of one form or another.

Let's take the first step and learn how to find the mean of a column of numbers.


Using C

Bash is not meant for this type of stuff. Mathematics in Bash is limited to integers only, which is very restrictive. Therefore in order to take an average, all the numbers in the list have to be even numbers in order to avoid fractions. Also, the data from the file is in the form of bytes, which are different from the integer values themselves. In the end, all things are possible between Heaven and Earth and I am sure there is a way to get Bash to work with this but right now the easiest option is to just use C.

The mean involves adding a group of numbers and dividing by the amount of numbers in that group. Straightforward enough but to do it repeatedly, even on a computer, is a bit tedious. Having thought about it, it is also unnecessary because if I have a group of numbers and find the average and I add another number to the group, I should be able to find the new average from the previous average and the value of the new number. So if I have a group of n-1 numbers and I have an average for that group and then I add a new number to that group of value Vn, then the new average can be found by:

Averagen = (((n-1)*Averagen-1) + Vn)/n

The above means there is not need to go through all the numbers repeatedly, just calculate the average and adjust it for each new number.

So, here's what I came up with. It was a bit more complicated than initially thought but it works.

#include <stdio.h>
#include <stdbool.h>
#include <stdlib.h>
#include <string.h>
int main()
{
	FILE *fp;
	char ch;
	int Countline=0;
    char filename[40];
    bool notnumeric = false;
    char lastch = '\n';
    int decpntcount=0;
    bool warnflag = false;
    char *buffer, *p;
    double mean=0;
    double term;
    buffer = malloc(100 * sizeof(char));
    p=buffer;
	/* request file name, open and read */
    printf("Enter file name: ");
    scanf("%s",filename);
	fp=fopen(filename,"r");
	if(fp==NULL)
	{
		printf("File \"%s\" not present!\n",filename);
		return -1;
	}
	/* read character by character and check for new line */
	while((ch = getc(fp))!=EOF)
	{
        if (ch!='0' && ch!='1' && ch!='2' && ch!='3' && ch!='4' && ch!='5' && ch!='6' && ch!='7' && ch!='8' && ch!='9' && ch!='.' && ch!='-' && ch!='\n'){
            notnumeric=true;}
        if (ch=='.'){
            decpntcount++;}
        if (lastch!='\n' && ch=='-'){
            notnumeric = true;}
		if(ch=='\n'){
            if (notnumeric==false && decpntcount<2 && lastch!='.' && lastch!='-' && lastch!='\n'){
                Countline++;
                decpntcount=0;
                *p='\0';
                sscanf(buffer, "%lf", &term);
                mean = (((Countline-1)*mean)+term)/Countline;
                memset(buffer, 0, 100 * sizeof(char));
                buffer = malloc(100 * sizeof(char));
                p=buffer;}
            else{
                warnflag=true;
                notnumeric=false;
                decpntcount=0;
                memset(buffer, 0, 100 * sizeof(char));
                buffer = malloc(100 * sizeof(char));
                p=buffer;}
        }
        else{
            *p++ = ch;
            }
        lastch=ch;
	 }
	/* close the file */
	fclose(fp);
    memset(buffer, 0, 100 * sizeof(char));
	/* print number of lines */
	printf("Total number of numeric lines within the file are: %d\n\n",Countline);
	/* print warning if applicable */
    if (warnflag == true){
	    printf("There is non-numeric data within the file, please check carefully for anything other than comments and blank lines.\n\n");}
    else{
        printf("The file contains numeric data only.\n\n");}
    /* print mean */
	printf("The mean of the numerical lines is: %lf\n\n",mean);
	return 0;	
}

This code opens the file and checks it line by line, it checks each line to make sure it is numeric. Any non-numeric lines are ignored. The mean is calculated with each line that is numeric, this program does not go through all previous lines in order to recalculate the mean.

Save the file as meancalc.c, or something else if you want as long as it ends with .c. There is no need to create an object file when compiling this small program, if using gcc, compile it with the following:

gcc meancalc.c -o meancalc

If you have a file with all the numerical data, such as filewithdata.txt, then run the program by:

/filepath/to/meancalc filewithdata.txt

Or, if you are in the same working directory as meancalc:

./meancalc filewithdata.txt

In years BC

As stated earlier, Bash can only do integer arithmatic but it can incorporate other command line entities. On such entity that is very old is the basic calculator called bc. It does not have a large range of functions but it can do decimal arithmatic, as long as one remembers to set the scale otherwise the integer result is an annoying surprise. The good news is that although Bash can't do anything beyond integer arithmatic, it can call on bc which can. So a Bash script that is much shorter than the C example given can be created:

#!/usr/bin/env bash
filename="$1"
mean2=0
function isanumber() { case ${1#[-+]} in ''|.|*[!0-9.]*|*.*.*) return 1;; esac ;}
amount=0
IFS=$'\n'
for LINE in $(cat "$filename")
do
    if isanumber "$LINE"; then
        amount=$((amount+1))
        mean=$mean2
        mean2=$(bc <<< "scale=15;(($mean*($amount-1))+$LINE)/$amount")
    fi
done
echo "The mean is $mean2"

The bc tool is very rough so expect a few rounding-off errors, such as 1.99999 instead of 2 and the like. The regular expression in the case part within the function allows determination of whether a line is a number or not. So this code roughly sorts the input as either a number or not-a-number and then includes any numbers within the calculation of the mean. The part scale=15 defines the number of decimal places, I chose a large number of decimal places to make any rounding-off errors more noticable. Using bc may make the answer a bit rougher but it works well enough and it is hightly likely that bc is already there if you are using even the most minimal form of Linux.

There are other ways to do this. I have always been a fan of Perl which can use regular expressions and the same mathematical method. It can also be done with awk, in fact if it is certain that all the input file is just numbers then a mean can be calculated with just 3 lines in an awk script. Adding a regular expression condition should not make the script that much larger. However, these are not guaranteed to be on minimal Linux systems and I am not a fan of loading something for the occasional one-off effort.

There are limits with both examples, neither accepts numbers using scientific notation. That would add another layer of complexity but the C program could be simplified a little if it included the math.h library and just checks if the input is a number or not and would allow for scientific notation. As far as I am aware, bc has no scientific notation capability.

So there it is. One can either write a Bash script and call bc for basic cases, or use something like Perl or awk or even Python if that is what you are comfortable with. A C program may be a bit long-winded but it is more flexible. Not the most surprising of conclusions but all the methods show that more work is required than expected in the simple act of finding the mean in a group of numbers.

About

This is where I place the very basic notes on programming from those starting at the very beginning using Linux or similar operating systems. It's set up to be understood by everyone. If you have an opinion as to how this page is done, then you are already and intermediate or advanced programmer and I don't care!