Our Ironclad Guarantee
You must be satisfied. Try our print books for 30 days or our eBooks for 14 days. If they aren't the best you've ever used, you can return the books or cancel the eBooks for a prompt refund. No questions asked!
These days, data scientists and analysts are in high demand by organizations of all types. This book covers everything you need to hit the ground running with using Python for data science and analysis.
To start, it presents a crash course in using the Pandas and Seaborn libraries for data analysis and visualization. Then, it presents a thorough course in data analysis, including how to use the Scikit-learn library to create statistical models that make predictions. Finally, it presents four case studies that tie these skills together and show how they’re used in the real world.
Go to our instructor’s site to learn more about this book and its instructor’s materials.
In his first at-bat, Scott McCoy smashes this one out of the park! This book is not just informative, it is exciting.”
This book is for anyone who wants to become a data scientist or data analyst. The only prerequisite is some programming experience in Python or any other programming language. That’s because chapter 1 presents the Python skills that you need for this book.
To present the essential Python skills for data science in a manageable progression and at the right pace, this book is divided into 4 sections.
This section gets you started fast. First, you’ll learn how to use JupyterLab and Notebooks to organize and work with Python code. Then, you’ll learn how to use the Pandas and Seaborn libraries for data analysis and visualization. By the end of this section, you’ll be able to start doing analyses of your own.
This section presents the descriptive analysis skills that are critical for success on the job. That includes how to:
This section presents the predictive analysis skills that you need to create statistical models that make predictions. Although predictive analysis is a large topic that could be an entire book of its own, this section presents the concepts you need to get started with it. More specifically, it shows you how to use the Scikit-learn library to create linear regression models to predict numeric values.
This section presents four complete analyses that show how the skills in this book can be applied to real-world datasets:
These in-depth analyses make sure that you master the professional skills that are in demand today from organizations of all types.
This book has features that are designed to make it as easy as possible for you to learn new skills faster and retain them better. Here are a few of those features:
The only software that’s needed for this book is the Anaconda distribution of Python. It includes JupyterLab, Pandas, Seaborn, Scikit-learn, and more.
Appendixes A and B show how to download and install this distribution on both Windows and macOS systems. Then, chapter 1 shows how to get started with JupyterLab.
“In his first at-bat, Scott McCoy smashes this one out of the park! This book is not just informative, it is exciting.”
—Scott Spurlock, Software Engineer, Georgia
“I really appreciated the four case studies. They…illustrated all phases of data analysis and visualization.”
—J. Jasperson – Texas A&M University
“Unlike some other books on data analysis with Python, the explanations of how to perform data analysis are thorough rather than terse or with no explanations.”
—Posted at an online bookseller
“This is a fun book for beginners and experienced data scientists.”
—Posted at an online bookseller
“This is my first exposure to Murach’s books, and I love them. I like the organization of the content, the consistent approach in each book, and the accuracy of the material.”
—Bob L., Michigan
“I really like the paired-pages format of detailed information on the left and quick notes on the right. This helps me to quickly find the information I’m looking for.”
—Roxanne T., Student, Washington
“I can’t praise this book highly enough. The clarity used in picking what to include, when to introduce it, and how to do so is remarkable.”
—Charles Ferguson, Software Developer, Australia
“Another thing I like is the exercises at the end of each chapter. They’re a great way to reinforce the main points of each chapter and force you to get your hands dirty.”
—Hien Luu, SD Forum/Java SIG
“Throughout the entire project, your book was indispensable to me. The answers were right there at every turn. All the examples made sense, and they all worked!”
—Alan Vogt, ETL Consultant, Massachusetts
“This book covers the perfect amount of description, and it does not make you bored by providing unnecessary details.”
—Posted at an online bookseller
“I picked up my first Murach book at a local bookstore in 2006, not knowing what was inside or what level of knowledge it would require of me, and it has changed my life since, literally. Your format (the paired pages) made it easy for me, an accountant with no IT or software development background, to understand databases and gain skills that proved useful throughout my entire career.”
—Giovanni Galope, Accountant, Philippines
On Murach’s Python Programming: “This is now my third book for Python, and it is the ONLY one that has made me feel comfortable solving problems and reading code. The paired pages approach is fantastic, and it makes learning the syntax, rules, and conventions understandable for me.”
—Posted at an online bookseller
“Your books shine out from the rest—the quality of writing and presentation of information is topnotch, and the consistency of quality across books is impressive.”
—Nolan Tamashiro, Developer
If you haven’t done that much Python programming before you read this book, we would like to recommend Murach’s Python Programming. It can help you raise your Python skills to a professional level, and it's an ideal on-the-job reference.
View the table of contents for this book in a PDF: Table of Contents (PDF)
Click on any chapter title to display or hide its content.
What data science is
The five phases of data analysis and visualization
The IDEs for Python data science
How to install and import the Python modules for data science
How to call and chain methods
The coding basics for Python data science
How to start JupyterLab and work with a Notebook
How to edit and run the cells in a Notebook
How to use the Tab completion and tooltip features
How syntax and runtime errors work
How to use Markdown language
How to get reference information
How to split the screen between two Notebooks
How to use Magic Commands
The Polling case study
The Forest Fires case study
The Social Survey case study
The Sports Analytics case study
The DataFrame structure
Two ways to get data into a DataFrame
How to save and restore a DataFrame
How to display the data in a DataFrame
How to use the attributes of a DataFrame
How to use the info(), nunique(), and describe() methods
How to access columns
How to access rows
How to access a subset of rows and columns
Another way to access a subset of rows and columns
How to sort the data
How to use the statistical methods
How to use Python for column arithmetic
How to modify the string data in columns
How to use indexes
How to pivot the data
How to melt the data
How to group the data
How to aggregate the data
How to plot the data
The Python libraries for data visualization
Long vs. wide data for data visualization
How the Pandas plot() method works by default
The three basic parameters for the Pandas plot() method
How to create a line plot or an area plot
How to create a scatter plot
How to create a bar plot
How to create a histogram or a density plot
How to create a box plot or a pie plot
How to improve the appearance of a plot
How to work with subplots
How to use chaining to get the plots you want
The Seaborn methods for plotting
The general methods vs. the specific methods
How to use the basic Seaborn parameters
How to use the Seaborn parameters for working with subplots
How to set the title, x label, and y label
How to set the ticks, x limits, and y limits
How to set the background style
How to work with subplots
How to save a plot
How to create a line plot
How to create a scatter plot
How to create a bar plot
How to create a box plot
How to create a histogram
How to create a KDE or ECDF plot
How to enhance a distribution plot
How to use other Axes methods to enhance a plot
How to annotate a plot
How to set the color palette
How to enhance a plot that has subplots
How to customize the titles for subplots
How to set the size of a specific plot
Common data sources
How to find and select the data that you want
How to import data directly into a DataFrame
How to download a file to disk before importing it
How to work with a zip file on disk
How to run queries against a database
How to use a SQL query to import data into a DataFrame
How to get and explore the metadata of a Stata file
How to build DataFrames for the metadata and the data
How to download a JSON file to disk
How to open a JSON file in JupyterLab
How to drill down into the data
How to build a DataFrame for the data
A general plan for cleaning the data
What the info() method can tell you
What the unique values can tell you
What the value counts can tell you
How to drop rows based on conditions
How to drop duplicate rows
How to drop columns
How to rename columns
How to find missing values
How to drop rows with missing values
How to fill missing values
How to find dates and numbers that are imported as objects
How to convert date and time strings to the datetime data type
How to convert object columns to numeric data types
How to work with the category data type
How to replace invalid values and convert a column’s data type
How to fix data problems when you import the data
How to find outliers
How to fix outliers
How to work with datetime columns
How to work with string columns
How to work with numeric columns
How to add a summary column to a DataFrame
How to apply functions to rows or columns
How to apply user-defined functions
How lambda expressions work with DataFrames
How to apply lambda expressions
How to set and remove an index
How to unstack indexed data
How to join DataFrames with an inner join
How to join DataFrames with a left or outer join
How to merge DataFrames
How to concatenate DataFrames
What the warning is telling you
How to handle the warning
How to melt columns to create long data
How to plot melted columns
How to group and apply a single aggregate method
How to work with a DataFrameGroupBy object
How to apply multiple aggregate methods
How to use the pivot() method
How to use the pivot_table() method
How to create bins of equal size
How to create bins with equal numbers of unique values
How to plot binned data
How to select the rows with the largest values
How to calculate the percent change
How to rank rows
How to find other methods for analysis
How to generate time periods
How to reindex with datetime indexes
How to reindex with a semi-month index
How a user-defined function can improve a datetime index
How reindexing with an improved index can improve plots
How to use the resample() method
How to use the label and closed parameters when you downsample
How downsampling can improve plots
The concept of rolling windows
How to create rolling windows
How to plot rolling window data
How to create running totals
How to plot running totals
Types of predictive models
Introduction to regression analysis
The Housing dataset
How to identify correlations with a scatter plot
How to identify correlations with a grid of scatter plots
How to identify correlations with r-values
How to identify correlations with a heatmap
A procedure for creating and using a regression model
The function and methods for linear regression models
How to create, validate, and use a linear regression model
How to plot the predicted data
How to plot the residuals
The lmplot() method and some of its parameters
How to plot a simple linear regression
How to plot a logistic regression
How to plot a polynomial regression
How to plot a lowess regression
How to use the residplot() method to plot the residuals
The Cars dataset
How to create a simple regression model
How to plot the residuals of a simple regression
How to create a multiple regression model
How to plot the residuals of a multiple regression
How to identify categorical variables
How to review categorical variables
How to create dummy variables
How to rescale the data and check the correlations
How to create a multiple regression that includes dummy variables
How to select the independent variables
How to test different combinations of variables
How to use Scikit-learn to select the variables
How to select the right number of variables
Import the modules that you will need
Get the data
Display the data
Examine the data
Drop columns and rows
Rename columns
Fix object columns
Fix data
Take an early plot with Pandas
Save the DataFrame
Add columns for grouping and filtering
Create a new DataFrame in long form
Take an early plot of the long data with Seaborn
Add monthly bins to the DataFrame
Add an average percent column for each month
Save the wide and long DataFrames
Plot the national and swing state polls
Plot the voter types
Plot the last two months of polling
Plot the gap changes in selected states
Prepare the gap data for the last week of polling
Plot the gap data for the last week of polling
Prepare the weekly gap data for the swing states
Plot the weekly gap data for the swing states
Connect and query the database
Import the data into a DataFrame
Examine the data
Improve the readability of the data
Drop unnecessary rows
Drop duplicate rows
Convert dates to datetime objects
Check for missing contain dates
Add fire_month and days_burning columns
Examine the contain_date and days_burning columns
Analyze the data for California
Two more plots for California fires
Rank the states by total acres burned
Prepare a DataFrame for total acres burned by year within state
Prepare a DataFrame for the top 4 states
Plot the acres burned total by year for the top 4 states
Review the 20 largest fires in California
Use GeoPandas to plot the California map
Use GeoPandas or Seaborn to plot the California fires on a map
Plot the fires in the continental United States
Build a DataFrame for the metadata
Use the codebook and read the data that you want
Prepare the data
Plot the data and reduce the number of categories
Plot the total counts of the responses
Convert the counts to percents and plot them
Search the codebook for small question sets
Read and review the work-life data
Plot the responses for the first question
Plot the responses for the second and third questions
Use the codebook to find related columns
Use the codebook to find follow-up questions
Select the columns for an expanded DataFrame
Bin the data for a column
Develop and test a first hypothesis
Develop and test a second hypothesis
Develop and test a third hypothesis
Get the data
Build the DataFrame
Locate and drop unneeded rows
Locate and drop unneeded columns
Convert the game_date column to datetime data
Add a column for the season
Add a column for the shot result
Add a column for points made for each shot
Add three summary columns
Plot the points per game by season
Plot the averages of shots, shots made, and points per game by season
Plot the shot locations for two games
Plot the shot locations for two seasons
Plot the shot density for one season
Plot the shot density for two seasons
How to download the files for this book
How to install Anaconda
How to use the Anaconda Navigator
How to create the murach environment
How to unzip some data and test your setup
How to use the Anaconda Prompt
How to download the files for this book
How to install Anaconda
How to use the Anaconda Navigator
How to create the murach environment
How to unzip some data and test your setup
How to use Terminal with an environment
This zip file includes the data files for the book as well as the JupyterLab Notebooks for:
Zip file Download Now
See for yourself how this book can get you started fast with using Python for data science.
This appendix shows how to set up your system so you’re ready to use Python for data science, including instructions for installing the required software and downloading the files for the book examples and exercises.
Windows PDF Download Now
macOS PDF Download Now
This chapter shows how to use the Pandas library to get the data for an analysis into a DataFrame and to clean, prepare, analyze, and visualize that data. These are the essential Pandas skills that you’ll use for almost every analysis.
Chapter 2 PDF Download Now
On this page, we’ll be posting answers to the questions that come up most often about our Python Data Science book. So if you have any questions that you haven’t found answered here, please email us. Thanks!
There are no book corrections that we know of at this time. But if you find any, please email us, and we’ll post any corrections that affect the technical accuracy of the book here. Thank you!
For orders and customer service:
1-800-221-5528
Weekdays, 8 to 4 Pacific Time
If you're a college instructor who would like to consider a book for a course, please visit our website for instructors to learn how to get a complimentary review copy and the full set of instructional materials.