Counting and modifying strains, words and characters in Linux text data files
[ad_1]
Linux features some handy instructions for counting when it comes to textual content documents. This publish examines some of the selections for counting lines and phrases and generating alterations that might support you see what you want.
Counting traces
Counting lines in a file is very effortless with the wc command. Use a command like that shown under, and you are going to get a fast response.
$ wc -l myfile 132 myfile
What the wc command is in fact counting is the number of newline figures in a file. So, if you had a one-line file with no newline character at the close, it would convey to you the file has lines,
The wc -l command can also count the strains in any textual content that is piped to it. In the instance underneath, wc -l is counting the amount of documents and directories in the existing directory.
$ ls -l | wc -l 1184
If you pipe textual content to a wc command with a hyphen as its argument, wc will count the traces, terms and people.
$ echo hi to you | wc - 1 3 13 -
The responses display the amount of traces (1), text (3) and figures (13 counting the newline).
If you want to get the very same info for a file, pipe the file to the wc command as revealed beneath.
$ cat notes | wc - 48 613 3705 -
Counting words and phrases
For just a term count, use the w selection as proven in the examples underneath.
$ wc -w notes 613 TT2 $ date | wc -w 7
Counting characters
To depend the figures in a file, use the -c alternative. Retain in intellect that this will depend newline figures as effectively as letters and punctuation marks.
$ wc -c TT2 3705 TT2
Counting instances of particular text
Counting how many situations a specific phrase appears in a file is a whole lot additional advanced. To rely how a lot of lines incorporate a word is significantly less complicated.
$ cat notes | grep the | wc -l 32 $ cat notes | grep [Tt]he | wc -l 40
The second command higher than counts strains made up of “the” whether or not the term is capitalized. It still won’t tell you how a lot of instances “the” appears all round, since any line that contains the phrase additional than the moment receives counted only the moment.
Ignoring punctuation and capitalization
Some terms (e.g., “The” and “the”) will surface in your term lists a lot more than when. You might be also going to see strings like “conclude” and “close.” considering that the commands described over do not different words and phrases from punctuation. To shift past these challenges, some additional instructions are included in the examples that follow.
Removing punctuation
In the command underneath, a file made up of a prolonged string of punctuation figures is passed to a tr -d command that eliminates all of them from the output. Observe how every little thing other than the “Figures ” string is taken off from the output.
$ cat punct-chars Characters .?,"!:'[](): $ cat punct-chars | tr -d '[:punct:]' Characters
Changing text to all lowercase
A tr command can flip all character to lowercase to make sure that words that start off with a money letter (frequently because they start out the sentence) or include all capitals are not listed individually from individuals appearing in all lowercase.
$ echo "Good day to YOU" | tr '[A-Z]' '[a-z]' good day to you
Working with a script
The script beneath sets up a few sets of commands for extracting the contents of a textual content file and extracting the terms utilizing increasingly extra complete approaches, so that you can see the output at each and every period.
Notice: The script passes the final collections of output to the column command to make the output a tiny easier to see.
#!/bin/bash echo -n "file: " go through file # different file into wor-for each-line format tr -s '[:blank:]' '[\n]' < $file> $file-2 # checklist phrases in columnar structure sort $file-2 | uniq -c | column echo -n "test following command?> " browse ans # eradicating punctuation form $file-2 | tr -d '[:punct:]' | uniq -c | column echo -n "try up coming command?> " browse ans # altering text to all lowercase form $file-2 | tr -d '[:punct:]' | tr '[A-Z]' '[a-z]' | uniq -c | column
The output below demonstrates what you would see if you ran the script versus the adhering to Einstein estimate:
"Two items are infinite: the universe and human stupidity and I am not absolutely sure about the universe." ― Albert Einstein
$ phrase-by-phrase file: Einstein 1 ― 1 human 2 the 1 about 1 I am 1 items 1 Albert 1 infinite: 1 "Two 2 and 1 not 1 universe 1 are 1 stupidity 1 universe." 1 Einstein 1 positive test following command?> y 1 ― 1 human 2 the 1 about 1 Im 1 matters 1 Albert 1 infinite 1 Two 2 and 1 not 2 universe 1 are 1 stupidity 1 Einstein 1 certain try out subsequent command?> y 1 ― 1 human 2 the 1 about 1 im 1 items 1 albert 1 infinite 1 two 2 and 1 not 2 universe 1 are 1 stupidity 1 einstein 1 guaranteed
Some of the consequences of eradicating punctuation have a downside as they eliminate the apostrophes from contractions like “it truly is”. The script also decapitalizes suitable names.
Note that the hyphen is not eradicated from the Einstein estimate by the punctuation elimination command. In addition, if your text incorporates still left- and correct-leaning double prices, they also is not going to be eradicated. This is mainly because these characters are not incorporated in the definition of ‘[:punct:]’.
Wrap-up
Linux incorporates a number of strategies for counting lines, phrases and figures in textual content and for building modifications that support depend the phrases. Some are just a bit additional advanced than other folks.
Copyright © 2023 IDG Communications, Inc.
[ad_2]
Source url