I can do command line in the notebook and take notes along the way. Let's go to the directory first.
cd playground
ls -alt
total 256 drwxr-xr-x+ 82 Tammy staff 2788 May 4 22:06 .. drwxr-xr-x 7 Tammy staff 238 May 4 21:42 . -rw-r--r--@ 1 Tammy staff 6148 May 1 22:21 .DS_Store -rw-r--r-- 1 Tammy staff 4608 May 1 22:00 iris.csv drwxr-xr-x 3 Tammy staff 102 May 1 09:40 play -rw-r-----@ 1 Tammy staff 114348 Mar 29 22:25 pybamview_example_data.tar.gz drwxr-xr-x@ 7 Tammy staff 238 Jul 11 2014 examples
we are going to work with the famous iris.csv dataset which is from R. First, look at the first serveral lines of the data.
head -5 iris.csv
sepal_length,sepal_width,petal_length,petal_width,species 5.1,3.5,1.4,0.2,Iris-setosa 4.9,3.0,1.4,0.2,Iris-setosa 4.7,3.2,1.3,0.2,Iris-setosa 4.6,3.1,1.5,0.2,Iris-setosa
To have a better view of the data, use csvlook command from csvkit. csvkit use comma as a default delimiter, if you have tab delimited file, use -t flag. There are many other useful commands,check the link above.
cat iris.csv | head | csvlook
|---------------+-------------+--------------+-------------+--------------| | sepal_length | sepal_width | petal_length | petal_width | species | |---------------+-------------+--------------+-------------+--------------| | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa | | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa | | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa | | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa | | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa | | 5.4 | 3.9 | 1.7 | 0.4 | Iris-setosa | | 4.6 | 3.4 | 1.4 | 0.3 | Iris-setosa | | 5.0 | 3.4 | 1.5 | 0.2 | Iris-setosa | | 4.4 | 2.9 | 1.4 | 0.2 | Iris-setosa | |---------------+-------------+--------------+-------------+--------------|
It is a comma seperated value file, we are going to look at some statistics by using datamash It is a very interesting GNU project, and I like it very much. It is very powerful and enable me to do some very useful stuff together with awk and sed. There are examples in the link working with gene annoation file.
Let's look at the average sepal_length for each species. we can do it in R by dplyr easily, but I am going to use command lines.
cat iris.csv | datamash -t "," -H -s -g 5 mean 1
GroupBy(species),mean(sepal_length) Iris-setosa,5.006 Iris-versicolor,5.936 Iris-virginica,6.588
-H flag means there is a header in the iris.csv file, -s flag means sort the file first, -g means group the data by specices and then calculate the mean of the first column.
Another very useful tool that I came across is q, which can execute SQL commands on plain txt files. q assumes the file is space delimited. use -d ","
for comma delimited and -t
for tab delimited files, respectively.
cat iris.csv | q -H -d "," "SELECT AVG(sepal_length), species from - Group BY species"
5.006,Iris-setosa 5.936,Iris-versicolor 6.588,Iris-virginica
we got the same result as using datamash.
I am going to use Rio to interact R on the command line and print out the figure using display command following the link here: IBash Notebook
cat iris.csv | Rio -ge "g+geom_point(aes(x=sepal_length,y=sepal_width,colour=species))"| display
we get this figure inline, which I think is very awesome!
others can be found in the post here Nevertherless, IBash Notebook gives a way to document your linux commands in a real-time manner and make your research reproducible to some extent!