DSAN 6000: Big Data and Cloud Computing
Fall 2025
Thursday, August 28, 2025
Course and syllabus overview
Big Data Concepts
Data Engineering
Introduction to bash
These are also pinned on the Slack main channel
aa1603@georgetown.edujj1088@georgetown.eduaa1603@georgetown.eduFun Facts

jj1088@georgetown.eduFun Facts

bc928@georgetown.edupp755@georgetown.eduau195@georgetown.eduyw924@georgetown.eduny159@georgetown.eduly290@georgetown.eduxz646@georgetown.edubc928@georgetown.edu(Lead TA for the course!)

pp755@georgetown.edu
au195@georgetown.edu
yw924@georgetown.edu
ny159@georgetown.edu
ly290@georgetown.edu
xz646@georgetown.edu
Data is everywhere! Many times, it’s just too big to work with traditional tools. This is a hands-on, practical workshop style course about using cloud computing resources to do analysis and manipulation of datasets that are too large to fit on a single machine and/or analyzed with traditional tools. The course will focus on Spark, MapReduce, the Hadoop Ecosystem and other tools.
You will understand how to acquire and/or ingest the data, and then massage, clean, transform, analyze, and model it within the context of big data analytics. You will be able to think more programmatically and logically about your big data needs, tools and issues.
Always refer to the syllabus and calendar in the course website for class policies.
dsan-Fall-2025@georgetown.eduWhere does it come from?
How is it being created?
Every 60 seconds in 2025:
We can record every:


Many interesting datasets have a graph structure:
Some of these are HUGE



75 billion connected devices generating data:


The Internet
Transactions
Databases
Excel
PDF Files
Anything digital (music, movies, apps)
Some old floppy disk lying around the house
Scenario 1: Traditional Big Data
You have a laptop with 16GB of RAM and a 256GB SSD. You are given a 1TB dataset in text files. What do you do?
Scenario 2: AI/ML Pipeline
Your company wants to build a RAG system using 10TB of internal documents. You need sub-second query response times. How do you architect this?
Scenario 3: Real-Time Analytics
You need to process 1 million events/second from IoT devices and provide real-time dashboards with <1s latency. What’s your stack?
Exponential data growth
Wikipedia
“A collection of datasets so large and complex that it becomes difficult to process using traditional tools and applications. Big Data technologies describe a new generation of technologies and architectures designed to economically extract value from very large volumes of a wide variety of data, by enabling high-velocity capture, discover and/or analysis”
O’Reilly
“Big data is when the size of the data itself becomes part of the problem”
IBM (The 3 V’s)
Additional V’s for 2025
\[ \text{``Size''} = f(\text{Processing Ability}, \text{Storage Space}) \]
If any of of the answers is no then you have a big-ish data problem!
Training Foundation Models
Data Requirements Have Exploded
Traditional Use Cases:
Modern AI Use Cases:
Key Components:
Model Context Protocol (MCP)
MCP in Production
Examples:
(Why Data Quality Matters More Than Ever)
Garbage In, Garbage Out - Amplified:
Data Quality Challenges in 2025
Netflix
Uber
OpenAI
Unified Platforms:
Edge Computing + AI:
Synthetic Data:
| Can be processed on single machine? | No | Medium (Parallel Processing) |
Big! Parallel + Distributed Processing |
| Yes | Small (Your Laptop) |
Medium (Data Streaming) |
|
| Yes | No | ||
| Can be stored on single machine? | |||
Modern Big Data Stack (2025)
Query Engines:
Data Warehouses & Lakes:
AI/ML Integration:
Orchestration:
| Small Data is usually… | On the other hand, Big Data… | |
|---|---|---|
| Goals | gathered for a specific goal | may have a goal in mind when it’s first started, but things can evolve or take unexpected directions |
| Location | in one place, and often in a single computer file | can be in multiple files in multiple servers on computers in different geographic locations |
| Structure/Contents | highly structured like an Excel spreadsheet, and it’s got rows and columns of data | can be unstructured, it can have many formats in files involved across disciplines, and may link to other resources |
| Preparation | prepared by the end user for their own purposes | is often prepared by one group of people, analyzed by a second group of people, and then used by a third group of people, and they may have different purposes, and they may have different disciplines |
| Small Data is usually… | On the other hand, Big Data… | |
|---|---|---|
| Longevity | kept for a specific amount of time after the project is over because there’s a clear ending point. In the academic world it’s maybe five or seven years and then you can throw it away | contains data that must be stored in perpetuity. Many big data projects extend into the past and future |
| Measurements | measured with a single protocol using set units and it’s usually done at the same time | is collected and measured using many sources, protocols, units, etc |
| Reproducibility | be reproduced in their entirety if something goes wrong in the process | replication is seldom feasible |
| Stakes | if things go wrong the costs are limited, it’s not an enormous problem | can have high costs of failure in terms of money, time and labor |
| Access | identified by a location specified in a row/column | unless it is exceptionally well designed, the organization can be inscrutable |
| Analysis | analyzed together, all at once | is ordinarily analyzed in incremental steps |
| The V | The Challenge |
|---|---|
| Volume | data scale |
| Value | data usefulness in decision making |
| Velocity | data processing: batch or stream |
| Viscosity | data complexity |
| Variability | data flow inconsistency |
| Volatility | data durability |
| Viability | data activeness |
| Validity | data properly understandable |
| Variety | data heterogeneity |
William Cohen (Director, Research Engineering, Google):
R and Python are single threaded (by default)Other:
Matt Turck’s Machine Learning, Artificial Intelligence & Data Landscape (MAD)
In this course, you’ll augment your data scientist skills with data engineering skills!





What do we learn from the prompt?
usernamehostnamecurrent_directory$ - this symbol means BA$HCOMMAND is the program, everything after that = argumentsF is a single letter flag, FLAG is a single word or words connected by dashes. A space breaks things into a new argument.F and FLAG)"file1" as the value for the FILE flag-h flag is usually to get help. You can also run the man command and pass the name of the program as the argument to get the help page.Let’s try basic commands:
date to get the current datewhoami to get your user nameecho "Hello World" to print to the consolepwdlstouch\* for wild card any number of characters\? for wild card for a single character[] for one of many character options! for exclusion[:alpha:], [:alnum:], [:digit:], [:lower:], [:upper:]pwd to determine the Present Working Directory.cd git-repo.. refers to the current directory, such as ./git-repo.. can be used to move up one level (cd ..), and can be combined to move up multiple levels (cd ../../my_folder)/ is the root of the filesystem: contains core folders (system, users)~ is the home directory. Move to folders referenced relative to this path by including it at the start of your path, for example ~/projects.treeNow that we know how to navigate through directories, we need commands for interacting with files…
mv to move files from one location to another
?, *, [], …cp to copy files instead of moving
?, *, [], …mkdir to make a directoryrm to remove filesrmdir to remove directoriesrm -rf to blast everything! WARNING!!! DO NOT USE UNLESS YOU KNOW WHAT YOU ARE DOINGhead FILENAME / tail FILENAME - glimpsing the first / last few rows of datamore FILENAME / less FILENAME - viewing the data with basic up / (up & down) controlscat FILENAME - print entire file contents into terminalvim FILENAME - open (or edit!) the file in vim editorgrep FILENAME - search for lines within a file that match a regex expressionwc FILENAME - count the number of lines (-l flag) or number of words (-w flag)| sends the stdout to another command (is the most powerful symbol in BASH!)> sends stdout to a file and overwrites anything that was there before>> appends the stdout to the end of a file (or starts a new file from scratch if one does not exist yet)< sends stdin into the command on the left/.bashrc is where your shell settings are located
How many processes? whoami | xargs ps -u | wc -l
Hard to remember full command! Let’s make an alias
General syntax:
For our case:
Now we need to put this alias into the .bashrc
Your commands get saved in ~/.bash_history
ps to see your running processestop or even better htop to see all running processessudo yum install htop -ykill [PID NUM] to “ask” the process to terminate. If things get really bad: kill -9 [PID NUM]cat on its own to let it stay open. Now open a new terminal to examine the processes and find the cat process.Bash crawl is a game to help you practice your navigation and file access skills. Click on the binder link in this repo to launch a jupyter lab session and explore!
DSAN 6000 Week 1: Course Overview