Counting and modifying strains, words and characters in Linux text data files

March 27, 2023

[ad_1]

Linux features some handy instructions for counting when it comes to textual content documents. This publish examines some of the selections for counting lines and phrases and generating alterations that might support you see what you want.

Counting traces

Counting lines in a file is very effortless with the wc command. Use a command like that shown under, and you are going to get a fast response.

$ wc -l myfile
132 myfile

What the wc command is in fact counting is the number of newline figures in a file. So, if you had a one-line file with no newline character at the close, it would convey to you the file has lines,

The wc -l command can also count the strains in any textual content that is piped to it. In the instance underneath, wc -l is counting the amount of documents and directories in the existing directory.

$ ls -l | wc -l
1184

If you pipe textual content to a wc command with a hyphen as its argument, wc will count the traces, terms and people.

$ echo hi to you | wc -
      1       3      13 -

The responses display the amount of traces (1), text (3) and figures (13 counting the newline).

If you want to get the very same info for a file, pipe the file to the wc command as revealed beneath.

$ cat notes | wc -
     48     613    3705 -

Counting words and phrases

For just a term count, use the w selection as proven in the examples underneath.

$ wc -w notes
613 TT2
$ date | wc -w
7

Counting characters

To depend the figures in a file, use the -c alternative. Retain in intellect that this will depend newline figures as effectively as letters and punctuation marks.

$ wc -c TT2
3705 TT2

Counting instances of particular text

Counting how many situations a specific phrase appears in a file is a whole lot additional advanced. To rely how a lot of lines incorporate a word is significantly less complicated.

$ cat notes | grep the | wc -l
32
$ cat notes | grep [Tt]he | wc -l
40

The second command higher than counts strains made up of “the” whether or not the term is capitalized. It still won’t tell you how a lot of instances “the” appears all round, since any line that contains the phrase additional than the moment receives counted only the moment.

Ignoring punctuation and capitalization

Some terms (e.g., “The” and “the”) will surface in your term lists a lot more than when. You might be also going to see strings like “conclude” and “close.” considering that the commands described over do not different words and phrases from punctuation. To shift past these challenges, some additional instructions are included in the examples that follow.

Removing punctuation

In the command underneath, a file made up of a prolonged string of punctuation figures is passed to a tr -d command that eliminates all of them from the output. Observe how every little thing other than the “Figures ” string is taken off from the output.

$ cat punct-chars
Characters .?,"!:'[]():
$ cat punct-chars | tr -d '[:punct:]'
Characters

Changing text to all lowercase

A tr command can flip all character to lowercase to make sure that words that start off with a money letter (frequently because they start out the sentence) or include all capitals are not listed individually from individuals appearing in all lowercase.

$ echo "Good day to YOU" | tr '[A-Z]' '[a-z]'
good day to you

Working with a script

The script beneath sets up a few sets of commands for extracting the contents of a textual content file and extracting the terms utilizing increasingly extra complete approaches, so that you can see the output at each and every period.

Notice: The script passes the final collections of output to the column command to make the output a tiny easier to see.

#!/bin/bash

echo -n "file: "
go through file

# different file into wor-for each-line format
tr -s '[:blank:]' '[\n]' < $file> $file-2

# checklist phrases in columnar structure
sort $file-2 | uniq -c | column

echo -n "test following command?> "
browse ans

# eradicating punctuation
form $file-2 | tr -d '[:punct:]' | uniq -c | column

echo -n "try up coming command?> "
browse ans

# altering text to all lowercase
form $file-2 | tr -d '[:punct:]' | tr '[A-Z]' '[a-z]' | uniq -c | column

The output below demonstrates what you would see if you ran the script versus the adhering to Einstein estimate:

"Two items are infinite: the universe and human stupidity and I am not absolutely sure about the universe."
― Albert Einstein

$ phrase-by-phrase
file: Einstein
      1 ―                     1 human                 2 the
      1 about                 1 I am                   1 items
      1 Albert                1 infinite:             1 "Two
      2 and                   1 not                   1 universe
      1 are                   1 stupidity            1 universe."
      1 Einstein              1 positive
test following command?> y
      1 ―                     1 human                 2 the
      1 about                 1 Im                    1 matters
      1 Albert                1 infinite              1 Two
      2 and                   1 not                   2 universe
      1 are                   1 stupidity
      1 Einstein              1 certain
try out subsequent command?> y
      1 ―                     1 human                 2 the
      1 about                 1 im                    1 items
      1 albert                1 infinite              1 two
      2 and                   1 not                   2 universe
      1 are                   1 stupidity
      1 einstein              1 guaranteed

Some of the consequences of eradicating punctuation have a downside as they eliminate the apostrophes from contractions like “it truly is”. The script also decapitalizes suitable names.

Note that the hyphen is not eradicated from the Einstein estimate by the punctuation elimination command. In addition, if your text incorporates still left- and correct-leaning double prices, they also is not going to be eradicated. This is mainly because these characters are not incorporated in the definition of ‘[:punct:]’.

Wrap-up

Linux incorporates a number of strategies for counting lines, phrases and figures in textual content and for building modifications that support depend the phrases. Some are just a bit additional advanced than other folks.

[ad_2]

Source url