STAT 29000: Project 8 — Fall 2021
Motivation:
Context: This is the third and final part in a series of projects that are designed to exercise skills around UNIX utilities, with a focus on writing bash scripts and awk
. You will get the opportunity to manipulate data without leaving the terminal. At first it may seem overwhelming, however, with just a little practice you will be able to accomplish data wrangling tasks really efficiently.
Scope: awk, bash scripts, R, Python
Questions
This is the last project based on bash and awk — the rest are SQL. If you struggled or did not like the bash projects, you are not alone! This is frequently the most intimidating for students. Students tend to really like the SQL projects, so relief is soon to come. |
Question 1
Take some time to explore /depot/datamine/data/taxi/**
, and answer the following questions using UNIX utilities.
-
In which two directories is the bulk of the data (except
fhv
— we don’t care about that data for now)? -
What is the total size in Gb of the data in those two directories?
So for example do all the files in |
Check out the PDFs in the directory to learn more about the dataset. |
-
Code used to solve this problem.
-
Output from running the code.
Question 2
To start, let’s focus on the yellow taxi data. The Total_Amt
column is the total cost of the taxi ride. It is broken down into 4 categories: Fare_Amt
, surcharge
, Tip_Amt
, and Tolls_Amt
.
Write a bash script called question2.sh
that accepts a path to a yellow taxi data file as an argument, and returns a breakdown of the overall percentage each of the 4 categories make up of the total.
fares: 5.0% surcharges: 2.5% tips: 2.5% tolls: 90.0%
To help get you started, here is some skeleton code.
#!/bin/bash
awk -F',' '{
# calculate stuff
fares+=$13;
} END {
# print stuff
}' $1
Make sure your output format matches this example exactly. Every value should be with 1 decimal place followed by a percentage sign. |
It may take a minute to run. You are processing 2.5G of data! |
This link may be useful. |
The result of the following.
Should be: fares: 92.6% surcharges: 1.7% tips: 4.5% tolls: 1.1% |
-
Code used to solve this problem.
-
Output from running the code.
Question 3
Did you know awk
has the ability to process multiple files at once? Pass multiple files to your script from question (2) to test it out.
%%bash
chmod +x ./question3.sh
./question3.sh /depot/datamine/data/taxi/yellow/yellow_tripdata_2009-01.csv /depot/datamine/data/taxi/yellow/yellow_tripdata_2009-02.csv
Now, modify your script from question (2). Return the summary values from question (2), but for each month instead of for the overall data. Use Trip_Pickup_dateTime
to determine the month.
January ---- fares: 5.0% surcharges: 2.5% tips: 2.5% tolls: 90.0% ---- February ---- fares: 5.0% surcharges: 2.5% tips: 2.5% tolls: 90.0% ---- etc..
You may will need to pass more than 1 file to your script in order to get more than 1 month of output. |
To help get you started, you can find some skeleton code below.
#!/bin/bash
awk -F',' 'BEGIN{
months[1] = "January"
months[2] = "February"
months[3] = "March"
months[4] = "April"
months[5] = "May"
months[6] = "June"
months[7] = "July"
months[8] = "August"
months[9] = "September"
months[10] = "October"
months[11] = "November"
months[12] = "December"
} NR > 1 {
# use split to parse out the month
# convert the month to int
month = int();
# sum values by month using awk array
} END {
for (m in total) {
if (m != 0) {
# print stuff
}
}
}' $@
-
Code used to solve this problem.
-
Output from running the code.
Pick 1 of the 2 following questions to answer. If you would like to answer both, your instructors and graders will be wow’d and happy (no pressure)! To be clear, however, you only need to answer 1 of the following 2 questions in order to get full credit. |
Question 4 (Option 1)
There are a lot of interesting questions that you could ask for this dataset. Here are some questions that could be interesting:
-
Does time of day, day of week, or month of year appear to have an effect on tips?
-
Are people indeed more generous (with tips) near Christmas?
-
How many trips are there, by hour of day? What are the rush hours?
-
Do different vendors charge more or less than other vendors?
Either choose a provided question, or write your own. Use your newfound knowledges of UNIX utilities and bash scripts to answer the question. Include the question you want answered, what, if any, hypotheses you have, what the data told you, and what you conclude (anecdotally).
-
Code used to solve this problem.
-
Output from running the code.
Question 4 (Option 2)
Standard UNIX utilities are not the end-all be-all to terminal tools. this repository has a lot of really useful tools that tend to have an opinionated take on a classic UNIX tool.
ripgrep is the poster child of this new generation of tools. It is a text search utility that is empirically superior in the majority of metrics (to grep
). Additionally, it has subjectively better defaults. You can read (in great detail) about ripgrep here.
In addition to those tools, there is xsv from the same developer as ripgrep. xsv
is a utility designed to perform operations on delimited separated value files. Many of the questions that have been asked about in the previous few projects could have been quickly and easily answered using xsv
.
Most of these utilities are available to you in a bash
cell in Jupyter Lab. Choose 2 questions from previous projects and re-answer them using these modern tools. Which did you prefer, and why?
-
Code used to solve this problem.
-
Output from running the code.
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. |