EMI
páginas com resultados encontrados.
páginas com resultados encontrados.
Prévia do material em texto
Felix Zumstein Python for ExcelA Modern Environment for Automation and Data AnalysisDATA / PY THON“This book explains how you can integrate Python into Excel and free yourself from the inevitable disaster of huge workbooks, thousands of formulas, and ugly VBA hacks. Python for Excel is probably the single most useful book on Excel that I have read and an absolute must-read for any advanced Excel user.”—Andreas F. ClenowCIO, Acies Asset Management, and author of international best-sellers Following the Trend, Stocks on the Move, and Trading EvolvedPython for ExcelISBN: 978-1-492-08100-5US $59.99 CAN $79.99Twitter: @oreillymediafacebook.com/oreillyWhile Excel remains ubiquitous in the business world, recent Microsoft feedback forums are full of requests to include Python as an Excel scripting language. In fact, it’s the top feature requested. What makes this combination so compelling? In this hands-on guide, Felix Zumstein—creator of xlwings, a popular open source package for automating Excel with Python—shows experienced Excel users how to integrate these two worlds efficiently.Excel has added quite a few new capabilities over the past couple of years, but its automation language, VBA, stopped evolving a long time ago. Many Excel power users have already adopted Python for daily automation tasks. This guide gets you started.• Use Python without extensive programming knowledge • Get started with modern tools, including Jupyter notebooks and Visual Studio Code• Use pandas to acquire, clean, and analyze data and replace typical Excel calculations • Automate tedious tasks like consolidation of Excel workbooks and production of Excel reports • Use xlwings to build interactive Excel tools that use Python as a calculation engine • Connect Excel to databases and CSV � les and fetch data from the internet using Python code• Use Python as a single tool to replace VBA, Power Query, and Power PivotFelix Zumstein is creator and maintainer of xlwings, a popular open source package that allows the automation of Excel with Python on Windows and macOS. As CEO of xltrail, a version control system for Excel � les, he’s gained deep insight into the typical use cases and issues with Excel across various industries.Python for ExcelPython for ExcelPraise for Python for ExcelWhat can Python do for Excel? If you’ve ever dealt with unexpected workbook crashes,broken calculations, and tedious manual processes, you’ll want to find out. Python forExcel is a comprehensive and succinct overview to getting started with Python as aspreadsheet user, and building powerful data products using both. Don’t let the fear oflearning to code keep you away: Felix provides an exceptional foundation for learningPython that even experienced programmers could benefit from. Moreover, he frames thisinformation in a way that is quickly accessible and applicable to you as an Excel user. Youcan quickly tell reading this book that it was written by someone with years of experienceteaching and working with clients on how to use Excel to its fullest extent with the help ofPython programming. Felix is uniquely suited to show you the possibilities of learningPython for Excel; I hope you enjoy the master class as much as I did.—George Mount, Founder, Stringfest AnalyticsPython is the natural progression from Excel and it’s tempting to simply discard Excelall together. Tempting, but hardly realistic. Excel is here, and here to stay, both in thecorporate world and as a useful desktop tool at home and in the office. This bookprovides the much needed bridge between these two worlds. It explains how you canintegrate Python into Excel and free yourself from the inevitable disaster of hugeworkbooks, thousands of formulas, and ugly VBA hacks. Python for Excel is probablythe single most useful book on Excel that I have read and an absolute must-read forany advanced Excel user. A highly recommended book!—Andreas F. Clenow, CIO Acies Asset Management and authorof international best-sellers Following the Trend, Stocks on the Move,and Trading EvolvedExcel remains a cornerstone tool of the financial world, but a vast amount of these Excelapplications are an irresponsible mess. This book does an excellent job of showing youhow to build better, more robust applications with the help of xlwings.—Werner Brönnimann, Derivatives and DeFi practitionerand cofounder, Ubinetic AGExcel and Python are two of the most important tools in the Business Analytics toolbox,and together they are far greater than the sum of their parts. In this book, Felix Zumsteinlays out his unparalleled mastery of the many ways to connect Python and Excel usingopen source, cross-platform solutions. It will be an invaluable tool for business analystsand data scientists alike, and any Python user looking to harness thepower of Excel in their code.—Daniel Guetta, Associate Professor of Professional Practice andDirector of the Business Analytics Initiative at ColumbiaBusiness School and coauthor of Python for MBAsFelix ZumsteinPython for ExcelA Modern Environment for Automationand Data AnalysisBoston Farnham Sebastopol TokyoBeijing Boston Farnham Sebastopol TokyoBeijing978-1-492-08100-5[LSI]Python for Excelby Felix ZumsteinCopyright © 2021 Zoomer Analytics LLC. All rights reserved.Printed in the United States of America.Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions arealso available for most titles (http://oreilly.com). For more information, contact our corporate/institutionalsales department: 800-998-9938 or corporate@oreilly.com.Acquisitions Editor: Michelle SmithDevelopment Editor: Melissa PotterProduction Editor: Daniel ElfanbaumCopyeditor: Piper Editorial Consulting, LLCProofreader: nSight Inc.Indexer: nSight Inc.Interior Designer: David FutatoCover Designer: Karen MontgomeryIllustrator: Kate DulleaMarch 2021: First EditionRevision History for the First Edition2021-03-04: First ReleaseSee http://oreilly.com/catalog/errata.csp?isbn=9781492081005 for release details.The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Python for Excel, the cover image, andrelated trade dress are trademarks of O’Reilly Media, Inc.The views expressed in this work are those of the author, and do not represent the publisher’s views.While the publisher and the author have used good faith efforts to ensure that the information andinstructions contained in this work are accurate, the publisher and the author disclaim all responsibilityfor errors or omissions, including without limitation responsibility for damages resulting from the use ofor reliance on this work. Use of the information and instructions contained in this work is at your ownrisk. If any code samples or other technology this work contains or describes is subject to open sourcelicenses or the intellectual property rights of others, it is your responsibility to ensure that your usethereof complies with such licenses and/or rights.http://oreilly.comhttp://oreilly.com/catalog/errata.csp?isbn=9781492081005Table of ContentsPreface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiPart I. Introduction to Python1. Why Python for Excel?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3Excel Is a Programming Language 4Excel in the News 5Programming Best Practices 6Moderncode themselves. This makes it easier to spoterrors and maintain the code going forward. That’s why one line in The Zen of Pythonis “readability counts.” The Zen of Python is a concise summary of Python’s coredesign principles, and we will learn how to print it in the next chapter. Let’s have alook at the following code snippet in VBA:If iVBA has fallen behind, as I will point out inthe next section.Modern Language FeaturesSince Excel 97, the VBA language hasn’t had any major changes in terms of languagefeatures. That, however, doesn’t mean that VBA isn’t supported anymore: Microsoft isshipping updates with every new release of Excel to be able to automate the new Excelfeatures introduced with that release. For example, Excel 2016 added support to auto‐mate Power Query. A language that stopped evolving more than twenty years ago ismissing out on modern language concepts that were introduced in all major pro‐gramming languages over the years. As an example, error handling in VBA is reallyshowing its age. If you’d like to handle an error gracefully in VBA, it goes somethinglike this:Sub PrintReciprocal(number As Variant) ' There will be an error if the number is 0 or a string On Error GoTo ErrorHandler result = 1 / number On Error GoTo 0 Debug.Print "There was no error!"Finally: ' Runs whether or not an error occurs If result = "" Then result = "N/A" End If Debug.Print "The reciprocal is: " & result Exit SubErrorHandler: ' Runs only in case of an error Debug.Print "There was an error: " & Err.Description Resume FinallyEnd Sub16 | Chapter 1: Why Python for Excel?VBA error handling involves the use of labels like Finally and ErrorHandler in theexample. You instruct the code to jump to these labels via the GoTo or Resume state‐ments. Early on, labels were recognized to be responsible for what many program‐mers would call spaghetti code: a nice way of saying that the flow of the code is hard tofollow and therefore difficult to maintain. That’s why pretty much all of the activelydeveloped languages have introduced the try/catch mechanism—in Python calledtry/except—that I will introduce in Chapter 11. If you are a proficient VBA devel‐oper, you might also enjoy the fact that Python supports class inheritance, a feature ofobject-oriented programming that is missing in VBA.Besides modern language features, there’s another requirement for a modern pro‐gramming language: cross-platform compatibility. Let’s see why this is important!Cross-Platform CompatibilityEven if you develop your code on a local computer that runs on Windows or macOS,it’s very likely that you want to run your program on a server or in the cloud at somepoint. Servers allow your code to be executed on a schedule and make your applica‐tion accessible from everywhere you want, with the computing power you need. Infact, I will show you how to run Python code on a server in the next chapter by intro‐ducing you to hosted Jupyter notebooks. The vast majority of servers run on Linux,as it is a stable, secure, and cost-effective operating system. And since Python pro‐grams run unchanged on all major operating systems, this will take out much of thepain when you transition from your local machine to a production setup.In contrast, even though Excel VBA runs on Windows and macOS, it’s easy to intro‐duce functionality that only runs on Windows. In the official VBA documentation oron forums, you will often see code like this:Set fso = CreateObject("Scripting.FileSystemObject")Whenever you have a CreateObject call or are being told to go to Tools > Referencesin the VBA editor to add a reference, you are almost always dealing with code thatwill only run on Windows. Another prominent area where you need to watch out ifyou want your Excel files to work across Windows and macOS are ActiveX controls.ActiveX controls are elements like buttons and dropdowns that you can place on yoursheets, but they work only on Windows. Make sure to avoid them if you want yourworkbook to run on macOS too!ConclusionIn this chapter, we met Python and Excel, two very popular technologies that havebeen around for multiple decades—a long time compared to many other technologiesthat we use today. The London Whale served as an example of how much can gowrong (in dollar terms) when you don’t use Excel properly with mission-criticalConclusion | 17workbooks. This was our motivation to look into a minimal set of programming bestpractices: applying separation of concerns, following the DRY principle, and makinguse of automated testing and version control. We then had a look at Power Query andPower Pivot, Microsoft’s approach at dealing with data that is bigger than yourspreadsheet. I, however, feel that they are often not the right solution, as they lock youinto the Microsoft world and prevent you from taking advantage of the flexibility andpower of modern cloud-based solutions.Python comes with convincing features that are missing in Excel: the standardlibrary, the package manager, libraries for scientific computing, and cross-platformcompatibility. By learning how to combine Excel with Python, you can have the bestof both worlds and will save time through automation, commit fewer errors as it’seasier to follow programming best practices, and you will be able to take your appli‐cation and scale it up outside of Excel if you ever need to.Now that you know why Python is such a powerful companion for Excel, it’s time toset up your development environment to be able to write your first lines of Pythoncode!18 | Chapter 1: Why Python for Excel?CHAPTER 2Development EnvironmentYou probably can’t wait to learn the basics of Python but before we get there, you firstneed to set up your computer accordingly. To write VBA code or Power Queries, it’senough to fire up Excel and open the VBA or Power Query editor, respectively. WithPython, it’s a bit more work.We will start this chapter by installing the Anaconda Python distribution. Besidesinstalling Python, Anaconda will also give us access to the Anaconda Prompt andJupyter notebooks, two essential tools that we will use throughout this book. TheAnaconda Prompt is a special Command Prompt (Windows) or Terminal (macOS); itallows us to run Python scripts and other command line tools that we will meet inthis book. Jupyter notebooks allow us to work with data, code, and charts in an inter‐active way, which makes them a serious competitor to Excel workbooks. After playingaround with Jupyter notebooks, we will install Visual Studio Code (VS Code), a pow‐erful text editor. VS Code works great for writing, running, and debugging Pythonscripts and comes with an integrated Terminal. Figure 2-1 summarizes what’sincluded in Anaconda and VS Code.As this book is about Excel, I am focusing on Windows and macOS in this chapter.However, everything up to and including Part III runs on Linux as well. Let’s getstarted by installing Anaconda!191 32-bit systems only exist with Windows and have become rare. An easy way to find out which Windows ver‐sion you have is by going to the C:\ drive in the File Explorer. If you can see both the Program Files andProgram Files (x86) folders, you are on a 64-bit version of Windows. If you can only see the Program Filesfolder, you are on a 32-bit system.Figure 2-1. Development environmentThe Anaconda Python DistributionAnaconda is arguably the most popular Python distribution used for data science andcomes with hundreds of third-party packages preinstalled: it includes Jupyter note‐books and most of the other packages that this book will use extensively, includingpandas, OpenPyXL, and xlwings. The Anaconda Individual Edition is free for privateuse and guarantees that all the included packages are compatible with each other. Itinstalls into a single folder and can easily be uninstalled again. After installing it, wewill learn a few basic commands on the Anaconda Prompt and run an interactivePython session. We’ll then meet the package managers Conda and pip before wrap‐ping this section up with Conda environments. Let’s get started by downloading andinstalling Anaconda!InstallationGo to the Anaconda home page and download the latest version of the Anacondainstaller(Individual Edition). Make sure to download the 64-bit graphical installerfor the Python 3.x version.1 Once downloaded, double-click the installer to start theinstallation process and make sure to accept all the defaults. For more detailed instal‐lation instructions, follow the official documentation.20 | Chapter 2: Development Environmenthttps://oreil.ly/QV7Nahttps://oreil.ly/r01wnOther Python DistributionsWhile the instructions in this book assume that you have the Ana‐conda Individual Edition installed, the code and concepts shownwill work with any other Python installation, too. In this case, how‐ever, you will have to install the required dependencies by follow‐ing the instructions included in requirements.txt in the companionrepository.With Anaconda installed, we can now start using the Anaconda Prompt. Let’s seewhat this is and how it works!Anaconda PromptThe Anaconda Prompt is really just a Command Prompt on Windows and a Terminalon macOS that have been set up to run with the correct Python interpreter and third-party packages. The Anaconda Prompt is the most basic tool to run Python code, andwe will make extensive use of it in this book to run Python scripts and all sorts ofcommand line tools that are offered by various packages.Anaconda Prompt without AnacondaIf you don’t use the Anaconda Python distribution, you will have touse the Command Prompt on Windows and the Terminal onmacOS whenever I instruct you to use the Anaconda Prompt.If you have never used a Command Prompt on Windows or a Terminal on macOS,don’t worry: you only need to know a handful of commands that will already give youa lot of power. Once you get used to it, using the Anaconda Prompt is often faster andmore convenient than clicking your way through graphical user menus. Let’s getstarted:WindowsClick on the Start menu button and start typing Anaconda Prompt. In the appear‐ing entries, choose Anaconda Prompt, not Anaconda Powershell Prompt. Eitherselect it with the arrow keys and hit Enter or use your mouse to click on it. If youprefer to open it via the Start menu, you will find it under Anaconda3. It is agood idea to pin the Anaconda Prompt to your Windows taskbar as you will useit regularly throughout this book. The input line of the Anaconda Prompt willstart with (base):(base) C:\Users\felix>The Anaconda Python Distribution | 21macOSOn macOS, you won’t find an application called Anaconda Prompt. Instead, byAnaconda Prompt, I am referring to the Terminal that has been set up by theAnaconda installer to automatically activate a Conda environment (I will saymore about Conda environments in a moment): press Command-Space bar oropen the Launchpad, then type in Terminal and hit Enter. Alternatively, open theFinder and navigate to Applications > Utilities, where you will find the Terminalapp that you can double-click. Once the Terminal appears, it should look some‐thing like this, i.e., the input line has to start with (base):(base) felix@MacBook-Pro ~ %If you are on an older version of macOS, it looks rather like this:(base) MacBook-Pro:~ felix$Unlike the Command Prompt on Windows, the Terminal on macOS doesn’tshow the full path of the current directory. Instead, the tilde stands for the homedirectory, which is usually /Users/. To see the full path of your cur‐rent directory, type pwd followed by Enter. pwd stands for print working directory.If the input line in your Terminal doesn’t start with (base) after the installationof Anaconda, here is a common reason: if you had the Terminal running duringthe Anaconda installation, you will need to restart it. Note that clicking on thered cross on the top left of the Terminal window will only hide it but not quitit. Instead, right-click on the Terminal in the dock and select Quit or hitCommand-Q while the Terminal is your active window. When you start it againand the Terminal shows (base) at the beginning of a new line, you are all set. It’sa good idea to keep the Terminal in your dock, as you will use it regularlythroughout this book.Having the Anaconda Prompt up and running, try out the commands outlined inTable 2-1. I am explaining each command in more detail after the table.Table 2-1. Commands for the Anaconda PromptCommand Windows macOSList files in current directory dir ls -laChange directory (relative) cd path\to\dir cd path/to/dirChange directory (absolute) cd C:\path\to\dir cd /path/to/dirChange to D drive D: (doesn’t exist)Change to parent directory cd .. cd ..Scroll through previous commands ↑ (up-arrow) ↑ (up-arrow)22 | Chapter 2: Development EnvironmentList files in current directoryOn Windows, type in dir for directory, then hit Enter. This will print the contentof the directory you are currently in.On macOS, type in ls -la followed by Enter. ls is short for list directory con‐tents, and -la will print the output in the long listing format and include all files,including hidden ones.Change directoryType cd Down and hit the Tab key. cd stands for change directory. If you are inyour home folder, the Anaconda Prompt should most likely be able to autocom‐plete it to cd Downloads. If you are in a different folder or don’t have a foldercalled Downloads, simply start to type the beginning of one of the directorynames you saw with the previous command (dir or ls -la) before hitting theTab key to autocomplete. Then hit Enter to change into the autocompleted direc‐tory. If you are on Windows and need to change your drive, you first need to typein the drive name before you can change into the correct directory:C:\Users\felix> D:D:\> cd dataD:\data>Note that by starting your path with a directory or file name that is within yourcurrent directory, you are using a relative path, e.g., cd Downloads. If you wouldlike to go outside of your current directory, you can type in an absolute path, e.g.,cd C:\Users on Windows or cd /Users on macOS (mind the forward slash atthe beginning).Change to parent directoryTo go to your parent directory, i.e., one level up in your directory hierarchy, typecd .. followed by Enter (make sure that there is a space between cd and thedots). You can combine this with a directory name, for example, if you want to goup one level, and then to change to the Desktop, enter cd ..\Desktop. OnmacOS, replace the backslash with a forward slash.Scroll through previous commandsUse the up-arrow key to scroll through the previous commands. This will saveyou many keystrokes if you need to run the same commands over and overagain. If you scroll too far, use the down-arrow key to scroll back.File ExtensionsUnfortunately, Windows and macOS hide file extensions by default in the WindowsExplorer or macOS Finder, respectively. This can make it harder to work with Pythonscripts and the Anaconda Prompt, as they will require you to refer to files includingThe Anaconda Python Distribution | 23their extensions. When working with Excel, showing file extensions also helps youunderstand whether you’re dealing with the default xlsx file, a macro-enabled xlsmfile, or any of the other Excel file formats. Here is how you make the file extensionsvisible:WindowsOpen a File Explorer and click on the View tab. Under the Show/Hide group,activate the “File name extensions” checkbox.macOSOpen the Finder and go to Preferences by hitting Command-, (Command-comma). On the Advanced tab, check the box next to “Show all filename extensions.”And that’s already it! You are now able to fire up the Anaconda Prompt and run com‐mands in the desired directory. You’ll be using this right away in the next section,where I’ll show you how to start an interactive Python session.Python REPL: An Interactive Python SessionYou can start an interactive Python session by running the python command on anAnaconda Prompt:(base) C:\Users\felix>pythonPython 3.8.5 (default, Sep 3 2020, 21:29:08) [...] :: Anaconda, Inc. on win32Type"help", "copyright", "credits" or "license" for more information.>>>The text that gets printed in a Terminal on macOS will slightly differ, but otherwise, itworks the same. This book is based on Python 3.8—if you would like to use a newerversion of Python, make sure to consult the book’s home page for instructions.Anaconda Prompt NotationGoing forward, I will start lines of code with (base)> to denotethat they are typed into an Anaconda Prompt. For example, tolaunch an interactive Python interpreter, I will write:(base)> pythonwhich on Windows will look similar to this:(base) C:\Users\felix> pythonand on macOS similar to this (remember, on macOS, the Terminalis your Anaconda Prompt):(base) felix@MacBook-Pro ~ % python24 | Chapter 2: Development Environmenthttps://xlwings.org/bookLet’s play around a bit! Note that >>> in an interactive session means that Pythonexpects your input; you don’t have to type this in. Follow along by typing in each linethat starts with >>> and confirm with the Enter key:>>> 3 + 47>>> "python " * 3'python python python 'This interactive Python session is also referred to as Python REPL, which stands forread-eval-print loop: Python reads your input, evaluates it, and prints the resultinstantly while waiting for your next input. Remember the Zen of Python that I men‐tioned in the previous chapter? You can now read the full version to get some insightinto the guiding principles of Python (smile included). Simply run this line by hittingEnter after typing it in:>>> import thisTo exit out of your Python session, type quit() followed by the Enter key. Alterna‐tively, hit Ctrl+Z on Windows, then hit the Enter key. On macOS, simply hit Ctrl-D—no need to press Enter.Having exited the Python REPL, it’s a good moment to play around with Conda andpip, the package managers that come with the Anaconda installation.Package Managers: Conda and pipI already said a few words about pip, Python’s package manager in the previous chap‐ter: pip takes care of downloading, installing, updating, and uninstalling Pythonpackages as well as their dependencies and subdependencies. While Anaconda workswith pip, it has a built-in alternative package manager called Conda. One advantageof Conda is that it can install more than just Python packages, including additionalversions of the Python interpreter. As a short recap: packages add additional func‐tionality to your Python installation that is not covered by the standard library. pan‐das, which I will properly introduce in Chapter 5, is an example of such a package.Since it comes preinstalled in Anaconda’s Python installation, you don’t have to installit manually.Conda vs. pipWith Anaconda, you should install everything you can via Condaand only use pip to install those packages that Conda can’t find.Otherwise, Conda may overwrite files that were previouslyinstalled with pip.The Anaconda Python Distribution | 25Table 2-2 gives you an overview of the commands that you will use most often. Thesecommands have to be typed into an Anaconda Prompt and will allow you to install,update, and uninstall your third-party packages.Table 2-2. Conda and pip commandsAction Conda pipList all installed packages conda list pip freezeInstall the latest package version conda install package pip install packageInstall a specific package version conda install package=1.0.0 pip install package==1.0.0Update a package conda update package pip install --upgrade packageUninstall a package conda remove package pip uninstall packageFor example, to see what packages are already available in your Anaconda distribu‐tion, type:(base)> conda listWhenever this book requires a package that is not included in the Anaconda installa‐tion, I will point this out explicitly and show you how to install it. However, it may bea good idea to take care of installing the missing packages now so that you won’t needto deal with it later on. Let’s first install plotly and xlutils, the packages that are avail‐able via Conda:(base)> conda install plotly xlutilsAfter running this command, Conda will show you what it’s going to do and requiresyou to confirm by typing y and hitting Enter. Once done, you can install pyxlsb andpytrends with pip, as these packages are not available via Conda:(base)> pip install pyxlsb pytrendsUnlike Conda, pip will install the packages right away when you hit Enter without theneed to confirm.Package VersionsMany Python packages are updated often and sometimes introducechanges that aren’t backward compatible. This will likely breaksome of the examples in this book. I will try to keep up with thesechanges and post fixes on the book’s home page, but you could alsocreate a Conda environment that uses the same versions of thepackages that I was using when writing this book. I will introduceConda environments in the next section, and you will find detailedinstructions on how to create a Conda environment with the spe‐cific packages in Appendix A.26 | Chapter 2: Development Environmenthttps://xlwings.org/bookYou know now how to use the Anaconda Prompt to start a Python interpreter andinstall additional packages. In the next section, I’ll explain what (base) at the begin‐ning of your Anaconda Prompt means.Conda EnvironmentsYou may have been wondering why the Anaconda Prompt shows (base) at thebeginning of each input line. It’s the name of the active Conda environment. A Condaenvironment is a separate “Python world” with a specific version of Python and a setof installed packages with specific versions. Why is this necessary? When you start towork on different projects in parallel, they will have different requirements: oneproject may use Python 3.8 with pandas 0.25.0, while another project may use Python3.9 with pandas 1.0.0. Code that is written for pandas 0.25.0 will often requirechanges to run with pandas 1.0.0, so you can’t just upgrade your Python and pandasversions without making changes to your code. Using a Conda environment for eachproject makes sure that every project runs with the correct dependencies. WhileConda environments are specific to the Anaconda distribution, the concept existswith every Python installation under the name virtual environment. Conda environ‐ments are more powerful because they make it easier to deal with different versions ofPython itself, not just packages.While you work through this book, you will not have to change your Conda environ‐ment, as we’ll always be using the default base environment. However, when you startbuilding real projects, it’s good practice to use one Conda or virtual environment foreach project to avoid any potential conflicts between their dependencies. Everythingyou need to know about dealing with multiple Conda environments is explained inAppendix A. There you will also find instructions on creating a Conda environmentwith the exact versions of the packages that I used to write this book. This will allowyou to run the examples in this book as-is for many years to come. The other optionis to watch the book’s home page for potential changes required for newer versions ofPython and the packages.Having resolved the mystery around Conda environments, it’s time to introduce thenext tool, one that we will use intensely in this book: Jupyter notebooks!Jupyter NotebooksIn the previous section, I showed you how to start an interactive Python session froman Anaconda Prompt. This is useful if you want a bare-bones environment to test outsomething simple. For the majority of your work, however, you want an environmentthat is easier to use. For example, going back to previous commands and displayingcharts is hard with a Python REPL running in an Anaconda Prompt. Fortunately,Anaconda comes with much more than just the Python interpreter: it also includesJupyter notebooks, which have emerged as one of the most popular ways to runJupyter Notebooks | 27https://xlwings.org/bookPython code in a data science context. Jupyter notebooks allow you to tell a story bycombining executable Python code with formatted text, pictures, and charts into aninteractive notebook that runs in your browser. They are beginner-friendly and thusespecially useful for the first steps of your Python journey. They are, however, alsohugely popular for teaching, prototyping, and researching, as they facilitate reprodu‐cible research.Jupyter notebooks have become a serious competitor to Excel as they cover roughlythe same use case as a workbook: you can quickly prepare, analyze, and visualize data.The difference to Excel is that all of it happens by writing Python code instead ofclicking around in Excel with your mouse. Another advantage is that Jupyter note‐books don’t mix data and business logic: the Jupyter notebook holds your code andcharts, whereas you typically consume data from an external CSV file or a database.Having Python code visible in your notebook makes it easy to see what’s going oncompared to Excel, where the formulas are hidden away behind a cell’s value. Jupyternotebooks are also easy to run both locally and on a remote server. Servers usuallyhave more power than your local machine and can run your code fully unattended,something that is hard to do with Excel.In this section, I’ll show you the very basics of how you run and navigate a Jupyternotebook: we will learn about notebook cells and see what the difference is betweenthe edit and command mode. We’ll then understand why the run order of cells mat‐ters before we wrap this section up by learning how to properly shut down note‐books. Let’s get started with our first notebook!Running Jupyter NotebooksOn your Anaconda Prompt, change to the directory of your companion repository,then launch a Jupyter notebook server:(base)> cd C:\Users\username\python-for-excel(base)> jupyter notebookThis will automatically open your browser and show the Jupyter dashboard with thefiles in the directory from where you were running the command. On the top right ofthe Jupyter dashboard, click on New, then select Python 3 from the dropdown list(see Figure 2-2).28 | Chapter 2: Development EnvironmentFigure 2-2. The Jupyter dashboardThis will open a new browser tab with your first empty Jupyter notebook as shown inFigure 2-3.Figure 2-3. An empty Jupyter notebookIt’s a good habit to click on Untitled1 next to the Jupyter logo to rename your work‐book into something more meaningful, e.g., first_notebook. The lower part ofFigure 2-3 shows a notebook cell—move on to the next section to learn more aboutthem!Notebook CellsIn Figure 2-3, you see an empty cell with a blinking cursor. If the cursor doesn’t blink,click into the cell with your mouse, i.e., to the right of In [ ]. Now repeat the exer‐cise from the last section: type in 3 + 4 and run the cell by either clicking on the Runbutton in the menu bar at the top or—much easier—by hitting Shift+Enter. This willrun the code in the cell, print the result below the cell and jump to the next cell. Inthis case, it inserts an empty cell below as we only have one cell so far. Going into a bitJupyter Notebooks | 29more detail: while a cell is calculating, it shows In [*] and when it’s done, the aster‐isk turns into a number, e.g., In [1]. Below the cell you will have the correspondingoutput labeled with the same number: Out [1]. Every time you run a cell, the counterincreases by one, which helps you to see in which order the cells were executed.Going forward, I will show the code samples in this format, e.g., the REPL examplefrom before looks like this:In [1]: 3 + 4Out[1]: 7This notation allows you to follow along easily by typing 3 + 4 into a notebook cell.When running it by hitting Shift+Enter, you will get what I show as output underOut[1]. If you read this book in an electronic format supporting colors, you willnotice that the input cell formats strings, numbers, and so on with different colors tomake it easier to read. This is called syntax highlighting.Cell OutputIf the last line in a cell returns a value, it is automatically printed bythe Jupyter notebook under Out [ ]. However, when you use theprint function or when you get an exception, it is printed directlybelow the In cell without the Out [ ] label. The code samples inthis book are formatted to reflect this behavior.Cells can have different types, two of which are of interest to us:CodeThis is the default type. Use it whenever you want to run Python code.MarkdownMarkdown is a syntax that uses standard text characters for formatting and canbe used to include nicely formatted explanations and instructions in your note‐book.To change a cell’s type to Markdown, select the cell, then choose Markdown in the cellmode dropdown (see Figure 2-3). I’ll show you a keyboard shortcut to change the cellmode in Table 2-3. After changing an empty cell into a Markdown cell, type in thefollowing text, which explains a few Markdown rules:# This is a first-level heading## This is a second-level headingYou can make your text *italic* or **bold** or `monospaced`.* This is a bullet point* This is another bullet point30 | Chapter 2: Development EnvironmentAfter hitting Shift+Enter, the text will be rendered into nicely formatted HTML. Atthis point, your notebook should look like what’s in Figure 2-4. Markdown cells alsoallow you to include images, videos, or formulas; see the Jupyter notebook docs.Figure 2-4. The notebook after running a code cell and a Markdown cellNow that you know about the code and Markdown cell types, it’s time to learn an eas‐ier way to navigate between cells: the next section introduces the edit and commandmode along with a few keyboard shortcuts.Edit vs. Command ModeWhen you interact with cells in a Jupyter notebook, you are either in the edit mode orin the command mode:Edit modeClicking into a cell starts the edit mode: the border around the selected cell turnsgreen, and the cursor in the cell is blinking. Instead of clicking into a cell, you canalso hit Enter when the cell is selected.Command modeTo switch into command mode, hit the Escape key; the border around theselected cell will be blue, and there won’t be any blinking cursor. The mostimportant keyboard shortcuts that you can use while being in command modeare shown in Table 2-3.Jupyter Notebooks | 31https://oreil.ly/elGTFTable 2-3. Keyboard shortcuts (command mode)Shortcut ActionShift+Enter Run the cell (works also in edit mode)↑ (up-arrow) Move cell selector up↓ (down-arrow) Move cell selector downb Insert a new cell below the current cella Insert a new cell above the current celldd Delete the current cell (type two times the letter d)m Change cell type to Markdowny Change cell type to codeKnowing these keyboard shortcuts will allow you to work with notebooks efficientlywithout having to switch between keyboard and mouse all the time. In the next sec‐tion, I’ll show you a common gotcha that you need to be aware of when using Jupyternotebooks: the importance of running cells in order.Run Order MattersAs easy and user-friendly notebooks are to get started, they also make it easy to getinto confusing states if you don’t run cells sequentially. Assume you have the follow‐ing notebook cells that are run from top to bottom:In [2]: a = 1In [3]: aOut[3]: 1In [4]: a = 2Cell Out[3] prints the value 1 as expected. However, if you now go back and runIn[3] again, you will end up in this situation:In [2]: a = 1In [5]: aOut[5]: 2In [4]: a = 2Out[5] shows now the value 2, which is probably not what you would expect whenyou read the notebook from the top, especially if cell In[4] would be farther away,requiring you to scroll down. To prevent such cases, I would recommend that yourerun not just a single cell, but all of its previous cells, too. Jupyter notebooks offeryou an easy way to accomplish this under the menuCell > Run all above. After thesewords of caution, let’s see how you shut down a notebook properly!32 | Chapter 2: Development EnvironmentShutting Down Jupyter NotebooksEvery notebook runs in a separate Jupyter kernel. A kernel is the “engine” that runsthe Python code you type into a notebook cell. Every kernel uses resources from youroperating system in the form of CPU and RAM. Therefore, when you close a note‐book, you should also shut down its kernel so that the resources can be used again byother tasks—this will prevent your system from slowing down. The easiest way toaccomplish this is by closing a notebook via File > Close and Halt. If you would justclose the browser tab, the kernel will not be shut down automatically. Alternatively,on the Jupyter dashboard, you can close running notebooks from the tab Running.To shut down the whole Jupyter server, click the Quit button at the top right of theJupyter dashboard. If you have already closed your browser, you can type Ctrl+Ctwice in the Anaconda Prompt where the notebook server is running or close theAnaconda Prompt altogether.Jupyter Notebooks in the CloudJupyter notebooks have become so popular that they are offered as a hosted solutionby various cloud providers. I am introducing three services here that are all free touse. The advantage of these services is that they run instantly and everywhere you canaccess a browser, without the need to install anything locally. You could, for example,run the samples on a tablet while reading the first three parts. Since Part IV requires alocal installation of Excel, this won’t work there, though.BinderBinder is a service provided by Project Jupyter, the organization behind Jupyternotebooks. Binder is meant to try out the Jupyter notebooks from public Gitrepositories—you don’t store anything on Binder itself and hence you don’t needto sign up or log in to use it.Kaggle NotebooksKaggle is a platform for data science. As it hosts data science competitions, youget easy access to a huge collection of datasets. Kaggle has been part of Googlesince 2017.Google ColabGoogle Colab (short for Colaboratory) is Google’s notebook platform. Unfortu‐nately, the majority of the Jupyter notebook keyboard shortcuts don’t work, butyou can access files on your Google Drive, including Google Sheets.The easiest way to run the Jupyter notebooks of the companion repository in thecloud is by going to its Binder URL. You will be working on a copy of the companionrepository, so feel free to edit and break stuff as you like!Jupyter Notebooks | 33https://mybinder.orghttps://kaggle.comhttps://oreil.ly/4PLcShttps://oreil.ly/MAjJKNow that know how to work with Jupyter notebooks, let’s move on and learn abouthow to write and run standard Python scripts. To do this, we’ll use Visual StudioCode, a powerful text editor with great Python support.Visual Studio CodeIn this section, we’ll install and configure Visual Studio Code (VS Code), a free andopen source text editor from Microsoft. After introducing its most important compo‐nents, we’ll write a first Python script and run it in a few different ways. To beginwith, however, I will explain when we’ll use Jupyter notebooks as opposed to runningPython scripts and why I chose VS Code for this book.While Jupyter notebooks are amazing for interactive workflows like researching,teaching, and experimenting, they are less ideal if you want to write Python scriptsgeared toward a production environment that do not need the visualization capabili‐ties of notebooks. Also, more complex projects that involve many files and developersare hard to manage with Jupyter notebooks. In this case, you want to use a propertext editor to write and run classic Python files. In theory, you could use just aboutany text editor (even Notepad would work), but in reality, you want one that “under‐stands” Python. That is, a text editor that supports at least the following features:Syntax highlightingThe editor colors words differently based on whether they represent a function, astring, a number, etc. This makes it much easier to read and understand the code.AutocompleteAutocomplete or IntelliSense, as Microsoft calls it, automatically suggests textcomponents so that you have to type less, which leads to fewer errors.And soon enough, you have other needs that you would like to access directly fromwithin the editor:Run codeSwitching back and forth between the text editor and an external AnacondaPrompt (i.e., Command Prompt or Terminal) to run your code can be a hassle.DebuggerA debugger allows you to step through the code line by line to see what’s goingon.Version controlIf you use Git to version control your files, it makes sense to handle the Git-related stuff directly in the editor so you don’t have to switch back and forthbetween two applications.34 | Chapter 2: Development EnvironmentThere is a wide spectrum of tools that can help you with all that, and as usual, everydeveloper has different needs and preferences. Some may indeed want to use a no-frills text editor together with an external Command Prompt. Others may prefer anintegrated development environment (IDE): IDEs try to put everything you’ll everneed into a single tool, which can make them bloated.I chose VS Code for this book as it has quickly become one of the most popular codeeditors among developers after its initial release in 2015: in the StackOverflow Devel‐oper Survey 2019, it came out as the most popular development environment. Whatmakes VS Code such a popular tool? In essence, it’s the right mix between a bare-bones text editor and a full-blown IDE: VS Code is a mini IDE that comes with every‐thing you need for programming out of the box, but not more:Cross-platformVS Code runs on Windows, macOS, and Linux. There are also cloud-hosted ver‐sions like GitHub Codespaces.Integrated toolsVS Code comes with a debugger, support for Git version control, and has an inte‐grated Terminal that you can use as Anaconda Prompt.ExtensionsEverything else, e.g., Python support, is added via extensions that can be installedwith a single click.LightweightDepending on your operating system, the VS Code installer is just 50–100 MB.Visual Studio Code vs. Visual StudioDon’t confuse Visual Studio Code with Visual Studio, the IDE!While you could use Visual Studio for Python development (itcomes with PTVS, the Python Tools for Visual Studio), it’s a reallyheavy installation and is traditionally used to work with .NET lan‐guages like C#.To find out if you agree with my praise for VS Code, there is no better way thaninstalling it and trying it out yourself. The next section gets you started!Visual Studio Code | 35https://oreil.ly/savHehttps://oreil.ly/savHehttps://oreil.ly/bDGWEInstallation and ConfigurationDownload the installer from the VS Code home page. For the latest installationinstructions, please always refer to the official docs.WindowsDouble-click the installer and accept all defaults. Then open VS Code via Win‐dows Start menu, where you will find it under Visual Studio Code.macOSDouble-click the ZIP file to unpack the app. Then drag and drop Visual StudioCode.app into the Applications folder: you can now start it from the Launchpad.If the application doesn’t start, go to System Preferences > Security & Privacy >General and choose Open Anyway.When you open VS Code for the first time, it looks like Figure 2-5. Note that I haveswitched from the default dark theme to a light theme to make the screenshots easierto read.Figure 2-5. Visual Studio CodeActivity BarOn the lefthand side, you see the Activity Bar with the following icons from topto bottom:• Explorer• Search• Source Control36 | Chapter 2: Development Environmenthttps://oreil.ly/26Jfa• Run• ExtensionsStatus BarAt the bottom of the editor, you have the Status Bar. Once you have the configu‐ration complete and edit a Python file,you will see the Python interpreter showup there.Command PaletteYou can show the Command Palette via F1 or with the keyboard shortcut Ctrl+Shift+P (Windows) or Command-Shift-P (macOS). If you are unsure aboutsomething, your first stop should always be the Command Palette, as it gives youeasy access to almost everything you can do with VS Code. For example, if youare looking for keyboard shortcuts, type in keyboard shortcuts, select the entry“Help: Keyboard Shortcuts Reference,” and hit Enter.VS Code is a great text editor out of the box, but to make it work nicely with Python,there are a few more things to configure: click on the Extensions icon on the ActivityBar and search for Python. Install the official Python extension that shows Microsoftas the author. It will take a moment to install and once done, you may need to clickthe Reload Required button to finish—alternatively, you could also restart VS Codecompletely. Finalize the configuration according to your platform:WindowsOpen the Command Palette and type default shell. Select the entry that reads“Terminal: Select Default Shell” and hit Enter. In the dropdown menu, selectCommand Prompt and confirm by hitting Enter. This is required because other‐wise VS Code can’t properly activate Conda environments.macOSOpen the Command Palette and type shell command. Select the entry that reads“Shell Command: Install ‘code’ command in PATH” and hit Enter. This isrequired so that you can start VS Code conveniently from the Anaconda Prompt(i.e., the Terminal).Now that VS Code is installed and configured, let’s use it to write and run our firstPython script!Running a Python ScriptWhile you can open VS Code via the Start menu on Windows or Launchpad onmacOS, it’s often faster to open VS Code from the Anaconda Prompt, where you areable to launch it via the code command. Therefore, open a new Anaconda Promptand change into the directory where you want to work by using the cd command,then instruct VS Code to open the current directory (represented by the dot):Visual Studio Code | 37(base)> cd C:\Users\username\python-for-excel(base)> code .Starting VS Code this way will cause the Explorer on the Activity Bar to automaticallyshow the contents of the directory you were in when you ran the code command.Alternatively, you could also open a directory via File > Open Folder (on macOS: File> Open), but this might cause permission errors on macOS when we start usingxlwings in Part IV. When you hover over the file list in the Explorer on the ActivityBar, you will see the New File button appear as shown in Figure 2-6. Click on NewFile and call your file hello_world.py, then hit Enter. Once it opens in the editor, writethe following line of code:print("hello world!")Remember that Jupyter notebooks conveniently print the return value of the last lineautomatically? When you run a traditional Python script, you need to tell Pythonexplicitly what to print, which is why you need to use the print function here. In theStatus Bar, you should now see your Python version, e.g., “Python 3.8.5 64-bit(conda).” If you click on it, the Command Palette will open and allow you to select adifferent Python interpreter if you have more than one (this includes Conda environ‐ments). Your set up should now look like the one in Figure 2-6.Figure 2-6. VS Code with hello_world.py open38 | Chapter 2: Development EnvironmentBefore we can run the script, make sure to save it by hitting Ctrl+S on Windows orCommand-S on macOS. With Jupyter notebooks, we could simply select a cell andhit Shift+Enter to run that cell. With VS Code, you can run your code from either theAnaconda Prompt or by clicking the Run button. Running Python code from theAnaconda Prompt is how you most likely run scripts that are on a server, so it’simportant to know how this works.Anaconda PromptOpen an Anaconda Prompt, cd into the directory with the script, then run thescript like so:(base)> cd C:\Users\username\python-for-excel(base)> python hello_world.pyhello world!The last line is the output that is printed by the script. Note that if you are not inthe same directory as your Python file, you need to use the full path to yourPython file:(base)> python C:\Users\username\python-for-excel\hello_world.pyhello world!Long File Paths on the Anaconda PromptA convenient way to deal with long file paths is to drag and dropthe file onto your Anaconda Prompt. This will write the full pathwherever the cursor is.Anaconda Prompt in VS CodeYou don’t need to switch away from VS Code to work with the AnacondaPrompt: VS Code has an integrated Terminal that you can show via the keyboardshortcut Ctrl+` or via View > Terminal. Since it opens in the project folder, youdon’t need to change the directory first:(base)> python hello_world.pyhello world!Run Button in VS CodeIn VS code, there is an easy way to run your code without having to use the Ana‐conda Prompt: when you edit a Python file, you will see a green Play icon at thetop right—this is the Run File button, as shown in Figure 2-6. Clicking it willopen the Terminal at the bottom automatically and run the code there.Visual Studio Code | 39Opening Files in VS CodeVS Code has an unconventional default behavior when you single-click a file in the Explorer (Activity Bar): the file is opened in pre‐view mode, which means that the next file that you single-click willreplace it in the tab unless you have made some changes to the file.If you want to switch off the single-click behavior (so a single-clickwill select a file and a double-click will open it), go to Preferences> Settings (Ctrl+, on Windows or Command-, on macOS) andset the dropdown under Workbench > “List: Open Mode” to“doubleClick.”At this point, you know how to create, edit, and run Python scripts in VS Code. VSCode can do quite a bit more, though: in Appendix B, I explain how to use the debug‐ger and how you can run Jupyter notebooks with VS Code.Alternative Text Editors and IDEsTools are something individual, and just because this book is based on Jupyter note‐books and VS Code doesn’t mean you shouldn’t have a look at other options.Some popular text editors include:Sublime TextSublime is a fast commercial text editor.Notepad++Notepad++ is free and has been around for a very long time but is Windows-only.Vim or EmacsVim or Emacs may not be the best options for beginner programmers due totheir steep learning curve, but they are very popular among professionals. Therivalry between the two free editors is so big that Wikipedia describes it as the“editor war.”Popular IDEs include:PyCharmThe PyCharm community edition is free and very powerful, while the professio‐nal edition is commercial and adds support for scientific tools and web develop‐ment.SpyderSpyder is similar to MATLAB’s IDE and comes with a variable explorer. Since it’sincluded in the Anaconda distribution, you can give it a try by running the fol‐lowing on an Anaconda Prompt: (base)> spyder.40 | Chapter 2: Development Environmenthttps://oreil.ly/9FVLDhttps://oreil.ly/7Ksk9https://vim.orghttps://oreil.ly/z__Kzhttps://oreil.ly/OrIj-https://spyder-ide.orgJupyterLabJupyterLab is a web-based IDE developed by the team behind Jupyter notebooksand can, of course, run Jupyter notebooks. Other than that, it tries to integrateeverything else you need for your data science tasks into a single tool.Wing Python IDEWing Python IDE is an IDE that has been around for a long time. There are freesimplified versions and a commercial version called Wing Pro.Komodo IDEKomodo IDE is a commercial IDE developed by ActiveState and supports manyother languages apart from Python.PyDevPyDev is a Python IDE based on the popular Eclipse IDE.ConclusionIn this chapter, I showed you how to install and use the tools we will work with: theAnaconda Prompt, Jupyter notebooks, and VS Code. We also ran a tiny bit of Pythoncodein a Python REPL, in a Jupyter notebook, and as script in VS Code.I do recommend you get comfortable with the Anaconda Prompt, as it will give you alot of power once you get used to it. The ability to work with Jupyter notebooks in thecloud is also very comfortable, as it allows you to run the code samples of the firstthree parts of this book in your browser.With a working development environment, you are now ready to tackle the nextchapter, where you’ll learn enough Python to be able to follow the rest of the book.Conclusion | 41https://jupyter.orghttps://wingware.comhttps://oreil.ly/Cdtabhttps://pydev.orgCHAPTER 3Getting Started with PythonWith Anaconda installed and Jupyter notebooks up and running, you have every‐thing in place to get started with Python. Although this chapter doesn’t go much fur‐ther than the basics, it still covers a lot of ground. If you are at the beginning of yourcoding career, there may be a lot to digest. However, most concepts will get cleareronce you use them in later chapters as part of a practical example, so there’s no needto worry if you don’t understand something fully the first time around. WheneverPython and VBA differ significantly, I will point this out to make sure you can transi‐tion from VBA to Python smoothly and are aware of the obvious traps. If you haven’tdone any VBA before, feel free to ignore these parts.I will start this chapter with Python’s basic data types, such as integers and strings.After that, I will introduce indexing and slicing, a core concept in Python that givesyou access to specific elements of a sequence. Up next are data structures like lists anddictionaries that can hold multiple objects. I’ll continue with the if statement and thefor and while loops before getting to an introduction of functions and modules thatallow you to organize and structure your code. To wrap this chapter up, I will showyou how to format your Python code properly. As you have probably guessed by now,this chapter is as technical as it can get. Running the examples for yourself in aJupyter notebook is therefore a good idea to make everything a bit more interactiveand playful. Either type the examples yourself or run them by using the providednotebooks in the companion repository.Data TypesPython, like every other programming language, treats numbers, text, booleans, etc.differently by assigning them a different data type. The data types that we will usemost often are integers, floats, booleans, and strings. In this section, I am going to43introduce them one after another with a few examples. To be able to understand datatypes, though, I first need to explain what an object is.ObjectsIn Python, everything is an object, including numbers, strings, functions, and every‐thing else that we’ll meet in this chapter. Objects can make complex things easy andintuitive by giving you access to a set of variables and functions. So before anythingelse, let me say a few words about variables and functions!VariablesIn Python, a variable is a name that you assign to an object by using the equal sign. Inthe first line of the following example, the name a is assigned to the object 3:In [1]: a = 3 b = 4 a + bOut[1]: 7This works the same for all objects, which is simpler compared to VBA, where youuse the equal sign for data types like numbers and strings and the Set statement forobjects like workbooks or worksheets. In Python, you change a variable’s type simplyby assigning it to a new object. This is referred to as dynamic typing:In [2]: a = 3 print(a) a = "three" print(a)3threeUnlike VBA, Python is case-sensitive, so a and A are two different variables. Variablenames must follow certain rules:• They must start with either a letter or an underscore• They must consist of letters, numbers, and underscoresAfter this short introduction to variables, let’s see how we can make function calls!FunctionsI will introduce functions with a lot more detail later in this chapter. For now, youshould simply know how to call built-in functions like print that we used in the pre‐vious code sample. To call a function, you add parentheses to the function name andprovide the arguments within the parentheses, which is pretty much equivalent to themathematical notation:44 | Chapter 3: Getting Started with Pythonfunction_name(argument1, argument2, ...)Let’s now look at how variables and functions work in the context of objects!Attributes and methodsIn the context of objects, variables are called attributes and functions are called meth‐ods: attributes give you access to the data of an object, and methods allow you to per‐form an action. To access attributes and methods, you use the dot notation like this:myobject.attribute and myobject.method().Let’s make this a bit more tangible: if you write a car racing game, you would mostlikely use an object that represents a car. The car object could have a speed attributethat allows you to get the current speed via car.speed, and you might be able toaccelerate the car by calling the accelerate method car.accelerate(10), whichwould increase the speed by ten miles per hour.The type of an object and with that its behavior is defined by a class, so the previousexample would require you to write a Car class. The process of getting a car objectout of a Car class is called instantiation, and you instantiate an object by callingthe class in the same way as you call a function: car = Car(). We won’t write ourown classes in this book, but if you are interested in how this works, have a look atAppendix C.We will use a first object method in the next section to make a text string uppercase,and we will get back to the topic of objects and classes when we talk about datetimeobjects toward the end of this chapter. Now, however, let’s move on with those objectsthat have a numeric data type!Numeric TypesThe data types int and float represent integers and floating-point numbers, respec‐tively. To find out the data type of a given object, use the built-in type:In [3]: type(4)Out[3]: intIn [4]: type(4.4)Out[4]: floatIf you want to force a number to be a float instead of an int, it’s good enough to usea trailing decimal point or the float constructor:In [5]: type(4.)Out[5]: floatIn [6]: float(4)Data Types | 45Out[6]: 4.0The last example can also be turned around: using the int constructor, you can turn afloat into an int. If the fractional part is not zero, it will be truncated:In [7]: int(4.9)Out[7]: 4Excel Cells Always Store FloatsYou may need to convert a float to an int when you read in anumber from an Excel cell and provide it as an argument to aPython function that expects an integer. The reason is that num‐bers in Excel cells are always stored as floats behind the scenes,even if Excel shows you what looks like an integer.Python has a few more numeric types that I won’t use or discuss in this book: thereare the decimal, fraction, and complex data types. If floating-point inaccuracies arean issue (see sidebar), use the decimal type for exact results. These cases are veryrare, though. As a rule of thumb: if Excel would be good enough for the calculations,use floats.Floating-point InaccuraciesBy default, Excel often shows rounded numbers: type =1.125-1.1 into a cell, and youwill see 0.025. While this might be what you expect, it is not what Excel stores inter‐nally. Change the display format to show at least 16 decimals, and it will change to0.0249999999999999. This is the effect of floating-point inaccuracy: computers live ina binary world, i.e., they calculate only with zeros and ones. Certain decimal fractionslike 0.1 can’t be stored as a finite binary floating-point number, which explains theresult from the subtraction. In Python, you will see the same effect, but Pythondoesn’t hide the decimals from you:In [8]: 1.125 - 1.1Out[8]: 0.02499999999999991Mathematical operatorsCalculating with numbers requires the use of mathematical operators like the plus orminus sign. Except for the power operator, there shouldn’t be any surprise if youcome from Excel:In [9]: 3 + 4 # SumOut[9]: 746 | Chapter 3: Getting Started with PythonIn [10]: 3 - 4 # SubtractionOut[10]: -1In [11]: 3 / 4 # DivisionOut[11]: 0.75In [12]: 3 * 4 # MultiplicationOut[12]: 12In [13]: 3**4 # The power operator (Excel uses 3^4)Out[13]: 81In [14]: 3 * (3 + 4) # Use of parenthesesOut[14]: 21CommentsIn the previous examples, I was describing the operation of the example by usingcomments (e.g., # Sum). Comments help other people (and yourself a few weeks afterwriting the code) to understand what’s going on in your program. It is good practiceto only comment those things that are not already evident from reading the code:when in doubt, it’s better to have no comment than an outdated comment that con‐tradicts the code. Anything starting with a hash sign is a comment in Python and isignored when you run the code:In [15]: # This is a sample we've seen before. # Every comment line has to start with a # 3 + 4Out[15]: 7In [16]: 3 + 4 # This is an inline commentOut[16]: 7Most editors have a keyboard shortcut to comment/uncomment lines. In Jupyternotebooks and VS Code, it is Ctrl+/ (Windows) or Command-/ (macOS). Note thatMarkdown cells in Jupyter notebooks won’t accept comments—if you start a line witha # there, Markdown will interpret this as a heading.Having integers and floats covered, let’s move straight to the next section aboutbooleans!BooleansThe boolean types in Python are True or False, exactly like in VBA. The booleanoperators and, or, and not, however, are all lowercase, while VBA shows themData Types | 47capitalized. Boolean expressions are similar to how they work in Excel, except forequality and inequality operators:In [17]: 3 == 4 # Equality (Excel uses 3 = 4)Out[17]: FalseIn [18]: 3 != 4 # Inequality (Excel uses 3 4)Out[18]: TrueIn [19]: 3 for bigger than.Out[19]: TrueIn [20]: 3 = for bigger or equal.Out[20]: TrueIn [21]: # You can chain logical expressions # In VBA, this would be: 10 . This will make a dropdown appear with all the attributesand methods that this object offers. If your cursor is in a method, for example withinthe parentheses of "python".upper(), hit Shift+Tab to get the description of thatfunction. VS Code will display this information automatically as a tooltip. If you run aPython REPL on the Anaconda Prompt, use dir("python") to get the availableattributes and help("python".upper) to print the description of the upper method.Other than that, it’s always a good idea to get back to Python’s online documentation.If you are looking for the documentation of third-party packages like pandas, it’shelpful to search for them on PyPI, Python’s package index, where you will find thelinks to the respective home pages and documentation.When working with strings, a regular task is to select parts of a string: for example,you may want to get the USD part out of the EURUSD exchange rate notation. The nextsection shows you Python’s powerful indexing and slicing mechanism that allows youto do exactly this.Indexing and SlicingIndexing and slicing give you access to specific elements of a sequence. Since stringsare sequences of characters, we can use them to learn how it works. In the next sec‐tion, we will meet additional sequences like lists and tuples that support indexing andslicing too.50 | Chapter 3: Getting Started with Pythonhttps://docs.python.orghttps://pypi.orgIndexingFigure 3-1 introduces the concept of indexing. Python is zero-based, which meansthat the first element in a sequence is referred to by index 0. Negative indices from -1allow you to refer to elements from the end of the sequence.Figure 3-1. Indexing from the beginning and end of a sequenceCommon Error Traps for VBA DevelopersIf you are coming from VBA, indexing is a common error trap.VBA uses one-based indexing for most collections like sheets(Sheets(1)) but uses zero-based indexing for arrays (MyArray(0)),although that default can be changed. Another difference is thatVBA uses parentheses for indexing while Python uses squarebrackets.The syntax forindexing is as follows:sequence[index]Accordingly, you access specific elements from a string like this:In [36]: language = "PYTHON"In [37]: language[0]Out[37]: 'P'In [38]: language[1]Out[38]: 'Y'In [39]: language[-1]Out[39]: 'N'In [40]: language[-2]Out[40]: 'O'You will often want to extract more than just a single character—this is where slicingcomes in.Indexing and Slicing | 51SlicingIf you want to get more than one element from a sequence, you use the slicing syntax,which works as follows:sequence[start:stop:step]Python uses half-open intervals: the start index is included while the stop indexis not. If you leave the start or stop arguments away, it will include everythingfrom the beginning or to the end of the sequence, respectively. step determines thedirection and the step size: for example, 2 will return every second element from leftto right and -3 will return every third element from right to left. The default step sizeis one:In [41]: language[:3] # Same as language[0:3]Out[41]: 'PYT'In [42]: language[1:3]Out[42]: 'YT'In [43]: language[-3:] # Same as language[-3:6]Out[43]: 'HON'In [44]: language[-3:-1]Out[44]: 'HO'In [45]: language[::2] # Every second elementOut[45]: 'PTO'In [46]: language[-1:-4:-1] # Negative step goes from right to leftOut[46]: 'NOH'So far we’ve looked at just a single index or slice operation, but Python also allowsyou to chain multiple index and slice operations together. For example, if you want toget the second character out of the last three characters, you could do it like this:In [47]: language[-3:][1]Out[47]: 'O'This is the same as language[-2] so in this case, it wouldn’t make much sense to usechaining, but it will make more sense when we use indexing and slicing with lists, oneof the data structures that I am going to introduce in the next section.Data StructuresPython offers powerful data structures that make working with a collection of objectsreally easy. In this section, I am going to introduce lists, dictionaries, tuples, and sets.While each of these data structures has slightly different characteristics, they are all52 | Chapter 3: Getting Started with Pythonable to hold multiple objects. In VBA, you may have used collections or arrays to holdmultiple values. VBA even offers a data structure called dictionary that works con‐ceptually the same as Python’s dictionary. It is, however, only available on the Win‐dows version of Excel out of the box. Let’s get started with lists, the data structure thatyou will probably use most.ListsLists are capable of holding multiple objects of different data types. They are so versa‐tile that you will use them all the time. You create a list as follows:[element1, element2, ...]Here are two lists, one with the names of Excel files and the other one with a fewnumbers:In [48]: file_names = ["one.xlsx", "two.xlsx", "three.xlsx"] numbers = [1, 2, 3]Like strings, lists can easily be concatenated with the plus sign. This also shows youthat lists can hold different types of objects:In [49]: file_names + numbersOut[49]: ['one.xlsx', 'two.xlsx', 'three.xlsx', 1, 2, 3]As lists are objects like everything else, lists can also have other lists as their elements.I will refer to them as nested lists:In [50]: nested_list = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]If you rearrange this to span over multiple lines, you can easily recognize that this is avery nice representation of a matrix, or a range of spreadsheet cells. Note that thesquare brackets implicitly allow you to break the lines (see sidebar). Via indexing andslicing, you get the elements you want:In [51]: cells = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]In [52]: cells[1] # Second rowOut[52]: [4, 5, 6]In [53]: cells[1][1:] # Second row, second and third columnOut[53]: [5, 6]Data Structures | 53Line ContinuationSometimes, a line of code can get so long that you will need to break it up into two ormore lines to keep your code readable. Technically, you can either use parentheses ora backslash to break up the line:In [54]: a = (1 + 2 + 3)In [55]: a = 1 + 2 \ + 3Python’s style guide, however, prefers that you use implicit line breaks if possible:whenever you are using an expression that contains parentheses, square brackets, orcurly braces, use them to introduce a line break without having to introduce an addi‐tional character. I will say more about Python’s style guide toward the end of thischapter.You can change elements in lists:In [56]: users = ["Linda", "Brian"]In [57]: users.append("Jennifer") # Most commonly you add to the end usersOut[57]: ['Linda', 'Brian', 'Jennifer']In [58]: users.insert(0, "Kim") # Insert "Kim" at index 0 usersOut[58]: ['Kim', 'Linda', 'Brian', 'Jennifer']To delete an element, use either pop or del. While pop is a method, del is imple‐mented as a statement in Python:In [59]: users.pop() # Removes and returns the last element by defaultOut[59]: 'Jennifer'In [60]: usersOut[60]: ['Kim', 'Linda', 'Brian']In [61]: del users[0] # del removes an element at the given indexSome other useful things you can do with lists are:In [62]: len(users) # LengthOut[62]: 2In [63]: "Linda" in users # Check if users contains "Linda"Out[63]: True54 | Chapter 3: Getting Started with PythonIn [64]: print(sorted(users)) # Returns a new sorted list print(users) # The original list is unchanged['Brian', 'Linda']['Linda', 'Brian']In [65]: users.sort() # Sorts the original list usersOut[65]: ['Brian', 'Linda']Note that you can use len and in with strings as well:In [66]: len("Python")Out[66]: 6In [67]: "free" in "Python is free and open source."Out[67]: TrueTo get access to elements in a list, you refer to them by their position or index—that’snot always practical. Dictionaries, the topic of the next section, allow you to get accessto elements via a key (often a name).DictionariesDictionaries map keys to values. You will come across key/value combinations all thetime. The easiest way to create a dictionary is as follows:{key1: value1, key2: value2, ...}While lists allow you to access elements by index, i.e., position, dictionaries allow youto access elements by key. As with indices, keys are accessed via square brackets. Thefollowing code samples will use a currency pair (key) that maps to the exchange rate(value):In [68]: exchange_rates = {"EURUSD": 1.1152, "GBPUSD": 1.2454, "AUDUSD": 0.6161}In [69]: exchange_rates["EURUSD"] # Access the EURUSD exchange rateOut[69]: 1.1152The following samples show you how to change existing values and add new key/value pairs:In [70]: exchange_rates["EURUSD"] = 1.2 # Change an existing value exchange_ratesOut[70]: {'EURUSD': 1.2, 'GBPUSD': 1.2454, 'AUDUSD': 0.6161}In [71]: exchange_rates["CADUSD"] = 0.714 # Add a new key/value pair exchange_ratesOut[71]: {'EURUSD': 1.2, 'GBPUSD': 1.2454, 'AUDUSD': 0.6161, 'CADUSD': 0.714}Data Structures | 55The easiest way to merge two or more dictionaries is by unpacking them into a newone. You unpack a dictionary by using two leading asterisks. If the second dictionarycontains keys from the first one, the values from the first will be overridden. You cansee this happening by looking at the GBPUSD exchange rate:In [72]: {**exchange_rates, **{"SGDUSD": 0.7004, "GBPUSD": 1.2222}}Out[72]: {'EURUSD': 1.2, 'GBPUSD': 1.2222, 'AUDUSD': 0.6161, 'CADUSD': 0.714, 'SGDUSD': 0.7004}Python 3.9 introduced the pipe character as a dedicated merge operator for dictionar‐ies, which allows you to simplify the previous expression to this:exchange_rates | {"SGDUSD": 0.7004, "GBPUSD": 1.2222}Many objects can serve as keys; the following is an example with integers:Excel 11Python for Excel 12Readability and Maintainability 13Standard Library and Package Manager 14Scientific Computing 15Modern Language Features 16Cross-Platform Compatibility 17Conclusion 172. Development Environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19The Anaconda Python Distribution 20Installation 20Anaconda Prompt 21Python REPL: An Interactive Python Session 24Package Managers: Conda and pip 25Conda Environments 27Jupyter Notebooks 27Running Jupyter Notebooks 28Notebook Cells 29vEdit vs. Command Mode 31Run Order Matters 32Shutting Down Jupyter Notebooks 33Visual Studio Code 34Installation and Configuration 36Running a Python Script 37Conclusion 413. Getting Started with Python. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43Data Types 43Objects 44Numeric Types 45Booleans 47Strings 49Indexing and Slicing 50Indexing 51Slicing 52Data Structures 52Lists 53Dictionaries 55Tuples 57Sets 57Control Flow 58Code Blocks and the pass Statement 58The if Statement and Conditional Expressions 59The for and while Loops 60List, Dictionary, and Set Comprehensions 63Code Organization 64Functions 65Modules and the import Statement 66The datetime Class 69PEP 8: Style Guide for Python Code 70PEP 8 and VS Code 73Type Hints 73Conclusion 74Part II. Introduction to pandas4. NumPy Foundations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77Getting Started with NumPy 77NumPy Array 77vi | Table of ContentsVectorization and Broadcasting 79Universal Functions (ufunc) 80Creating and Manipulating Arrays 81Getting and Setting Array Elements 82Useful Array Constructors 83View vs. Copy 83Conclusion 845. Data Analysis with pandas. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85DataFrame and Series 85Index 88ColumnsIn [73]: currencies = {1: "EUR", 2: "USD", 3: "AUD"}In [74]: currencies[1]Out[74]: 'EUR'By using the get method, dictionaries allow you to use a default value in case the keydoesn’t exist:In [75]: # currencies[100] would raise an exception. Instead of 100, # you could use any other non-existing key, too. currencies.get(100, "N/A")Out[75]: 'N/A'Dictionaries can often be used when you would use a Case statement in VBA. Theprevious example could be written like this in VBA:Select Case xCase 1 Debug.Print "EUR"Case 2 Debug.Print "USD"Case 3 Debug.Print "AUD"Case Else Debug.Print "N/A"End SelectNow that you know how to work with dictionaries, let’s move on to the next datastructure: tuples. They are similar to lists with one big difference, as we will see in thenext section.56 | Chapter 3: Getting Started with PythonTuplesTuples are similar to lists with the difference that they are immutable: once created,their elements can’t be changed. While you can often use tuples and lists interchange‐ably, tuples are the obvious choice for a collection that never changes throughout theprogram. Tuples are created by separating values with commas:mytuple = element1, element2, ...Using parentheses often makes it easier to read:In [76]: currencies = ("EUR", "GBP", "AUD")Tuples allow you to access elements the same way as lists, but they won’t allow you tochange elements. Instead, concatenating tuples will create a new tuple behind thescenes, then bind your variable to this new tuple:In [77]: currencies[0] # Accessing the first elementOut[77]: 'EUR'In [78]: # Concatenating tuples will return a new tuple. currencies + ("SGD",)Out[78]: ('EUR', 'GBP', 'AUD', 'SGD')I explain the difference between mutable vs. immutable objects in detail in Appen‐dix C, but for now, let’s have a look at the last data structure of this section: sets.SetsSets are collections that have no duplicate elements. While you can use them for settheory operations, in practice they often help you to get the unique values of a list ora tuple. You create sets by using curly braces:{element1, element2, ...}To get the unique objects in a list or a tuple, use the set constructor like so:In [79]: set(["USD", "USD", "SGD", "EUR", "USD", "EUR"])Out[79]: {'EUR', 'SGD', 'USD'}Other than that, you can apply set theory operations like intersection and union:In [80]: portfolio1 = {"USD", "EUR", "SGD", "CHF"} portfolio2 = {"EUR", "SGD", "CAD"}In [81]: # Same as portfolio2.union(portfolio1) portfolio1.union(portfolio2)Out[81]: {'CAD', 'CHF', 'EUR', 'SGD', 'USD'}In [82]: # Same as portfolio2.intersection(portfolio1) portfolio1.intersection(portfolio2)Data Structures | 57Out[82]: {'EUR', 'SGD'}For a full overview of set operations, see the official docs. Before moving on, let’squickly revise the four data structures we just met in Table 3-1. It shows a sample foreach data structure in the notation I used in the previous paragraphs, the so-calledliterals. Additionally, I am also listing their constructors that offer an alternative tousing the literals and are often used to convert from one data structure to another.For example, to convert a tuple to a list, do:In [83]: currencies = "USD", "EUR", "CHF" currenciesOut[83]: ('USD', 'EUR', 'CHF')In [84]: list(currencies)Out[84]: ['USD', 'EUR', 'CHF']Table 3-1. Data structuresData Structure Literals ConstructorList [1, 2, 3] list((1, 2, 3))Dictionary {"a": 1, "b": 2} dict(a=1, b=2)Tuple (1, 2, 3) tuple([1, 2, 3])Set {1, 2, 3} set((1, 2, 3))At this point, you know all important data types including basic ones like floats andstrings, and data structures like lists and dictionaries. In the next section, we move onto control flow.Control FlowThis section presents the if statement as well as the for and while loops. The ifstatement allows you to execute certain lines of code only if a condition is met, andthe for and while loops will execute a block of code repeatedly. At the end of the sec‐tion, I will also introduce list comprehensions, which are a way to construct lists thatcan serve as an alternative to for loops. I will start this section with the definition ofcode blocks, for which I also need to introduce one of Python’s most noteworthy par‐ticularities: significant white space.Code Blocks and the pass StatementA code block defines a section in your source code that is used for something special.For example, you use a code block to define the lines over which your program islooping or it makes up the definition of a function. In Python, you define code blocksby indenting them, not by using keywords like in VBA or curly braces like in most58 | Chapter 3: Getting Started with Pythonhttps://oreil.ly/ju4edother languages. This is referred to as significant white space. The Python communityhas settled on four spaces as indentation, but you usually type them in by hitting theTab key: both Jupyter notebooks and VS Code will automatically convert your Tabkey into four spaces. Let me show you how code blocks are formally defined by usingthe if statement:if condition: pass # Do nothingThe line preceding the code block always terminates with a colon. Since the end ofthe code block is reached when you no longer indent the line, you need to use thepass statement if you want to create a dummy code block that does nothing. In VBA, this would correspond to the following:If condition Then ' Do nothingEnd IfNow that you know how to define code blocks, let’s start using them in the next sec‐tion, where I will properly introduce the if statement.The if Statement and Conditional ExpressionsTo introduce the if statement, let me reproduce the example from “Readability andMaintainability” on page 13 in Chapter 1, but this time in Python:In [85]: i = 20 if i 0 instead.Conditional expressions, also called ternary operators, allow you to use a more com‐pact style for simple if/else statements:In [88]: is_important = False print("important") if is_important else print("not important")not importantWith if statements and conditional expressions in our pocket, let’s turn our attentionto for and while loops in the next section.The for and while LoopsIf you need to do something repeatedly like printing the value of ten different vari‐ables, you are doing yourself a big favor by not copy/pasting the print statement tentimes. Instead, use a for loop to do the work for you. for loops iterate over the itemsof a sequence likea list, a tuple, or a string (remember, strings are sequences of char‐acters). As an introductory example, let’s create a for loop that takes each element ofthe currencies list, assigns it to the variable currency and prints it—one afteranother until there are no more elements in the list:In [89]: currencies = ["USD", "HKD", "AUD"] for currency in currencies: print(currency)USDHKDAUDAs a side note, VBA’s For Each statement is close to how Python’s for loop works.The previous example could be written like this in VBA:Dim currencies As VariantDim curr As Variant 'currency is a reserved word in VBAcurrencies = Array("USD", "HKD", "AUD")60 | Chapter 3: Getting Started with PythonFor Each curr In currencies Debug.Print currNextIn Python, if you need a counter variable in a for loop, the range or enumerate built-ins can help you with that. Let’s first look at range, which provides a sequence ofnumbers: you call it by either providing a single stop argument or by providing astart and stop argument, with an optional step argument. Like with slicing, startis inclusive, stop is exclusive, and step determines the step size, with 1 being thedefault:range(stop)range(start, stop, step)range evaluates lazily, which means that without explicitly asking for it, you won’t seethe sequence it generates:In [90]: range(5)Out[90]: range(0, 5)Converting the range to a list solves this issue:In [91]: list(range(5)) # stop argumentOut[91]: [0, 1, 2, 3, 4]In [92]: list(range(2, 5, 2)) # start, stop, step argumentsOut[92]: [2, 4]Most of the time, there’s no need to wrap range with a list, though:In [93]: for i in range(3): print(i)012If you need a counter variable while looping over a sequence, use enumerate. Itreturns a sequence of (index, element) tuples. By default, the index starts at zeroand increments by one. You can use enumarate in a loop like this:In [94]: for i, currency in enumerate(currencies): print(i, currency)0 USD1 HKD2 AUDLooping over tuples and sets works the same as with lists. When you loop over dic‐tionaries, Python will loop over the keys:Control Flow | 61In [95]: exchange_rates = {"EURUSD": 1.1152, "GBPUSD": 1.2454, "AUDUSD": 0.6161} for currency_pair in exchange_rates: print(currency_pair)EURUSDGBPUSDAUDUSDBy using the items method, you get the key and the value at the same time as tuple:In [96]: for currency_pair, exchange_rate in exchange_rates.items(): print(currency_pair, exchange_rate)EURUSD 1.1152GBPUSD 1.2454AUDUSD 0.6161To exit a loop, use the break statement:In [97]: for i in range(15): if i == 2: break else: print(i)01You skip the remainder of a loop with the continue statement, which means that exe‐cution continues with a new loop and the next element:In [98]: for i in range(4): if i == 2: continue else: print(i)013When comparing for loops in VBA with Python, there is a subtle difference: in VBA,the counter variable increases beyond your upper limit after finishing the loop:For i = 1 To 3 Debug.Print iNext iDebug.Print i62 | Chapter 3: Getting Started with PythonThis prints:1234In Python, it behaves like you would probably expect it to:In [99]: for i in range(1, 4): print(i) print(i)1233Instead of looping over a sequence, you can also use while loops to run a loop while acertain condition is true:In [100]: n = 0 while nyou leave itaway, the function automatically returns None. Python conveniently allows you toreturn multiple values separated by commas.To be able to play around with a function, let’s define one that is able to convert thetemperature from Fahrenheit or Kelvin to degrees Celsius:In [107]: def convert_to_celsius(degrees, source="fahrenheit"): if source.lower() == "fahrenheit": return (degrees-32) * (5/9) elif source.lower() == "kelvin": return degrees - 273.15 else: return f"Don't know how to convert from {source}"I am using the string method lower, which transforms the provided strings to lower‐case. This allows us to accept the source string with any capitalization while the com‐parison will still work. With the convert_to_celsius function defined, let’s see howwe can call it!Code Organization | 65Calling functionsAs briefly mentioned at the beginning of this chapter, you call a function by addingparentheses to the function name, enclosing the function arguments:value1, value2, ... = function_name(positional_arg, arg_name=value, ...)Positional argumentsIf you provide a value as a positional argument (positional_arg), the values arematched to the arguments according to their position in the function definition.Keyword argumentsBy providing the argument in the form arg_name=value, you’re providing a key‐word argument. This has the advantage that you can provide the arguments inany order. It is also more explicit to the reader and can make it easier to under‐stand. For example, if the function is defined as f(a, b), you could call the func‐tion like this: f(b=1, a=2). This concept also exists in VBA, where you could usekeyword arguments by calling a function like this: f(b:=1, a:=1).Let’s play around with the convert_to_celsius function to see how this all works inpractice:In [108]: convert_to_celsius(100, "fahrenheit") # Positional argumentsOut[108]: 37.77777777777778In [109]: convert_to_celsius(50) # Will use the default source (fahrenheit)Out[109]: 10.0In [110]: convert_to_celsius(source="kelvin", degrees=0) # Keyword argumentsOut[110]: -273.15Now that you know how to define and call functions, let’s see how to organize themwith the help of modules.Modules and the import StatementWhen you write code for bigger projects, you will have to split it into different files atsome point to be able to bring it into a maintainable structure. As we have alreadyseen in the previous chapter, Python files have the extension .py and you usually referto your main file as a script. If you now want your main script to access functionalityfrom other files, you need to import that functionality first. In this context, Pythonsource files are called modules. To get a better feeling for how this works and what thedifferent import options are, have a look at the file temperature.py in the companionrepository by opening it with VS Code (Example 3-1). If you need a refresher on howto open files in VS Code, have another look at Chapter 2.66 | Chapter 3: Getting Started with PythonExample 3-1. temperature.pyTEMPERATURE_SCALES = ("fahrenheit", "kelvin", "celsius")def convert_to_celsius(degrees, source="fahrenheit"): if source.lower() == "fahrenheit": return (degrees-32) * (5/9) elif source.lower() == "kelvin": return degrees - 273.15 else: return f"Don't know how to convert from {source}"print("This is the temperature module.")To be able to import the temperature module from your Jupyter notebook, you willneed the Jupyter notebook and the temperature module to be in the same directory—as it is in the case of the companion repository. To import, you only use the nameof the module, without the .py ending. After running the import statement, you willhave access to all the objects in that Python module via the dot notation. For example,use temperature.convert_to_celsius() to perform your conversion:In [111]: import temperatureThis is the temperature module.In [112]: temperature.TEMPERATURE_SCALESOut[112]: ('fahrenheit', 'kelvin', 'celsius')In [113]: temperature.convert_to_celsius(120, "fahrenheit")Out[113]: 48.88888888888889Note that I used uppercase letters for TEMPERATURE_SCALES to express that it is a con‐stant—I will say more about that toward the end of this chapter. When you executethe cell with import temperature, Python will run the temperature.py file from top tobottom. You can easily see this happening since importing the module will fire theprint function at the bottom of temperature.py.Modules Are Only Imported OnceIf you run the import temperature cell again, you will notice thatit does not print anything anymore. This is because Python mod‐ules are only imported once per session. If you change code in amodule that you import, you need to restart your Python inter‐preter to pick up all the changes, i.e., in a Jupyter notebook, you’dhave to click on Kernel > Restart.Code Organization | 67In reality, you usually don’t print anything in modules. This was only to show you theeffect of importing a module more than once. Most commonly, you put functionsand classes in your modules (for more on classes, see Appendix C). If you don’t wantto type temperature every time you use an object from the temperature module,change the import statement like this:In [114]: import temperature as tpIn [115]: tp.TEMPERATURE_SCALESOut[115]: ('fahrenheit', 'kelvin', 'celsius')Assigning a short alias tp to your module can make it easier to use while it’s stillalways clear where an object comes from. Many third-party packages suggest a spe‐cific convention when using an alias. For example, pandas is using import pandas aspd. There is one more option to import objects from another module:In [116]: from temperature import TEMPERATURE_SCALES, convert_to_celsiusIn [117]: TEMPERATURE_SCALESOut[117]: ('fahrenheit', 'kelvin', 'celsius')The __pycache__ FolderWhen you import the temperature module, you will see thatPython creates a folder called __pycache__ with files that havethe .pyc extension. These are bytecode-compiled files that thePython interpreter creates when you import a module. For ourpurposes, we can simply ignore this folder, as it is a technical detailof how Python runs your code.When using the from x import y syntax, you import specific objects only. By doingthis, you are importing them directly into the namespace of your main script: that is,without looking at the import statements, you won’t be able to tell whether theimported objects were defined in your current Python script or Jupyter notebook or ifthey come from another module. This could cause conflicts: if your main script has afunction called convert_to_celsius, it would override the one that you are import‐ing from the temperature module. If, however, you use one of the two previousmethods, your local function and the one from the imported module could live nextto each other as convert_to_celsius and temperature.convert_to_celsius.Don’t Name Your Scripts Like Existing PackagesA common source for errors is to name your Python file the sameas an existing Python package or module. If you create a file to testout some pandas functionality, don’t call that file pandas.py, as thiscan cause conflicts.68 | Chapter 3: Getting Started with PythonNow that you know how the import mechanism works, let’s use it right away toimport the datetime module! This will also allow you to learn a few more thingsabout objects and classes.The datetime ClassWorking with date and time is a common operation in Excel, but it comes with limi‐tations: for example, Excel’s cell format for time doesn’t support smaller units thanmilliseconds and time zones are not supported at all. In Excel, date and time arestored as a simple float called the date serial number. The Excel cell is then formattedto displayit as date and/or time. For example, January 1, 1900 has the date serialnumber of 1, which means that this is also the earliest date that you can work with inExcel. Time gets translated into the decimal part of the float, e.g., 01/01/190010:10:00 is represented by 1.4236111111.In Python, to work with date and time, you import the datetime module, which ispart of the standard library. The datetime module contains a class with the samename that allows us to create datetime objects. Since having the same name for themodule and the class can be confusing, I will use the following import conventionthroughout this book: import datetime as dt. This makes it easy to differentiatebetween the module (dt) and the class (datetime).Up to this point, we were most of the time using literals to create objects like lists ordictionaries. Literals refer to the syntax that Python recognizes as a specific objecttype—in the case of a list, this would be something like [1, 2, 3]. However, most ofthe objects have to be created by calling their class: this process is called instantiation,and objects are therefore also called class instances. Calling a class works the same wayas calling a function, i.e., you add parentheses to the class name and provide the argu‐ments in the same way we did with functions. To instantiate a datetime object, youneed to call the class like this:import datetime as dtdt.datetime(year, month, day, hour, minute, second, microsecond, timezone)Let’s go through a couple of examples to see how you work with datetime objects inPython. For the purpose of this introduction, let’s ignore time zones and work withtime-zone-naive datetime objects:In [118]: # Import the datetime module as "dt" import datetime as dtIn [119]: # Instantiate a datetime object called "timestamp" timestamp = dt.datetime(2020, 1, 31, 14, 30) timestampOut[119]: datetime.datetime(2020, 1, 31, 14, 30)Code Organization | 69In [120]: # Datetime objects offer various attributes, e.g., to get the day timestamp.dayOut[120]: 31In [121]: # The difference of two datetime objects returns a timedelta object timestamp - dt.datetime(2020, 1, 14, 12, 0)Out[121]: datetime.timedelta(days=17, seconds=9000)In [122]: # Accordingly, you can also work with timedelta objects timestamp + dt.timedelta(days=1, hours=4, minutes=11)Out[122]: datetime.datetime(2020, 2, 1, 18, 41)To format datetime objects into strings, use the strftime method, and to parse astring and convert it into a datetime object, use the strptime function (you can findan overview of the accepted format codes in the datetime docs):In [123]: # Format a datetime object in a specific way # You could also use an f-string: f"{timestamp:%d/%m/%Y %H:%M}" timestamp.strftime("%d/%m/%Y %H:%M")Out[123]: '31/01/2020 14:30'In [124]: # Parse a string into a datetime object dt.datetime.strptime("12.1.2020", "%d.%m.%Y")Out[124]: datetime.datetime(2020, 1, 12, 0, 0)After this short introduction to the datetime module, let’s move on to the last topicof this chapter, which is about formatting your code properly.PEP 8: Style Guide for Python CodeYou may have been wondering why I was sometimes using variable names withunderscores or in all caps. This section will explain my formatting choices by intro‐ducing you to Python’s official style guide. Python uses so-called Python Enhance‐ment Proposals (PEP) to discuss the introduction of new language features. One ofthese, the Style Guide for Python Code, is usually referred to by its number: PEP 8.PEP 8 is a set of style recommendations for the Python community; if everybody whoworks on the same code adheres to the same style guide, the code becomes muchmore readable. This is especially important in the world of open source where manyprogrammers work on the same project, often without knowing each other person‐ally. Example 3-2 shows a short Python file that introduces the most importantconventions.Example 3-2. pep8_sample.py"""This script shows a few PEP 8 rules. """70 | Chapter 3: Getting Started with Pythonhttps://oreil.ly/gXOtsimport datetime as dt TEMPERATURE_SCALES = ("fahrenheit", "kelvin", "celsius") class TemperatureConverter: pass # Doesn't do anything at the moment def convert_to_celsius(degrees, source="fahrenheit"): """This function converts degrees Fahrenheit or Kelvin into degrees Celsius. """ if source.lower() == "fahrenheit": return (degrees-32) * (5/9) elif source.lower() == "kelvin": return degrees - 273.15 else: return f"Don't know how to convert from {source}"celsius = convert_to_celsius(44, source="fahrenheit") non_celsius_scales = TEMPERATURE_SCALES[:-1] print("Current time: " + dt.datetime.now().isoformat())print(f"The temperature in Celsius is: {celsius}")Explain what the script/module does with a docstring at the top. A docstring is aspecial type of string, enclosed with triple quotes. Apart from serving as a stringfor documenting your code, a docstring also makes it easy to write strings overmultiple lines and is useful if your text contains a lot of double-quotes or single-quotes, as you won’t need to escape them. They are also useful to write multilineSQL queries, as we will see in Chapter 11.All imports are at the top of the file, one per line. List the imports of the standardlibrary first, then those of third-party packages, and finally those from your ownmodules. This sample only makes use of the standard library.Use capital letters with underscores for constants. Use a maximum line length of79 characters. If possible, take advantage of parentheses, square brackets, or curlybraces for implicit line breaks.Separate classes and functions with two empty lines from the rest of the code.PEP 8: Style Guide for Python Code | 71Despite the fact that many classes like datetime are all lowercase, your ownclasses should use CapitalizedWords as names. For more on classes, see Appen‐dix C.Inline comments should be separated by at least two spaces from the code. Codeblocks should be indented by four spaces.Functions and function arguments should use lowercase names with underscoresif they improve readability. Don’t use spaces between the argument name and itsdefault value.A function’s docstring should also list and explain the function arguments. Ihaven’t done this here to keep the sample short, but you will find complete doc‐strings in the excel.py file that is included in the companion repository and thatwe will meet in Chapter 8.Don’t use spaces around the colon.Use spaces around mathematical operators. If operators with different prioritiesare used, you may consider adding spaces around those with the lowest priorityonly. Since the multiplication in this example has the lowest priority, I haveadded spaces around it.Use lowercase names for variables. Make use of underscores if they improvereadability. When assigning a variable name, use spaces around the equal sign.However, when calling a function, don’t use spaces around the equal sign usedwith keyword arguments.With indexing and slicing, don’t use spaces around the square brackets.This is a simplified summary of PEP 8, so it’s a good idea to have a look at the originalPEP 8 once you start to get more serious with Python. PEP 8 clearly states that it is arecommendation and that your own style guides will take precedence. After all, con‐sistency is the most important factor. If you are interested in other publicly availableguidelines, you may want to have a look at Google’s style guide for Python, which isreasonably close to PEP 8. In practice, most Python programmers loosely adhere toPEP 8, and ignoring the maximum line length of 79 characters is probably the mostcommon sin.Since it might be difficult to format your code properly while writing it, you can haveyourstyle checked automatically. The next section shows you how this works withVS Code.72 | Chapter 3: Getting Started with Pythonhttps://oreil.ly/3fTTZhttps://oreil.ly/6sYSaPEP 8 and VS CodeWhen working with VS Code, there is an easy way to make sure your code sticks toPEP 8: use a linter. A linter checks your source code for syntax and style errors. Fireup the command palette (Ctrl+Shift+P on Windows or Command-Shift-P onmacOS) and search for Python: Select Linter. A popular option is flake8, a packagethat comes preinstalled with Anaconda. If enabled, VS Code will underline issueswith squiggly lines every time you save your file. Hovering over such a squiggly linewill give you an explanation in a tooltip. You switch a linter off again by searching for“Python: Enable Linting” in the command palette and choosing “Disable Linting.” Ifyou prefer, you can also run flake8 on an Anaconda Prompt to have a report printed(the command only prints something if there is a violation of PEP 8, so running thison pep8_sample.py won’t print anything unless you introduce a violation):(base)> cd C:\Users\username\python-for-excel(base)> flake8 pep8_sample.pyPython has recently taken static code analysis a step further by adding support fortype hints. The next section explains how they work.Type HintsIn VBA, you often see code that prefixes each variable with an abbreviation for thedata type, like strEmployeeName or wbWorkbookName. While nobody will stop youfrom doing this in Python, it isn’t commonly done. You also won’t find an equivalentto VBA’s Option Explicit or Dim statement to declare the type of a variable. Instead,Python 3.5 introduced a feature called type hints. Type hints are also referred to astype annotations and allow you to declare the data type of a variable. They are com‐pletely optional and have no effect on how the code is run by the Python interpreter(there are, however, third-party packages like pydantic that can enforce type hints atruntime). The main purpose of type hints is to allow text editors like VS Code tocatch more errors before running the code, but they can also improve code autocom‐pletion of VS Code and other editors. The most popular type checker for type anno‐tated code is mypy, which VS Code offers as a linter. To get a feeling of how typeannotations work in Python, here is a short sample without type hints:x = 1def hello(name): return f"Hello {name}!"PEP 8: Style Guide for Python Code | 73https://oreil.ly/J9W8hAnd again with type hints:x: int = 1def hello(name: str) -> str: return f"Hello {name}!"As type hints generally make more sense in bigger codebases, I am not going to usethem in the remainder of this book.ConclusionThis chapter was a packed introduction to Python. We met the most important build‐ing blocks of the language, including data structures, functions, and modules. We alsotouched on some of Python’s particularities like meaningful white space and code for‐matting guidelines, better known as PEP 8. To continue with this book, you won’tneed to know all the details: as a beginner, just knowing about lists and dictionaries,indexing and slicing, as well as how to work with functions, modules, for loops, andif statements will get you far already.Compared to VBA, I find Python more consistent and powerful but at the same timeeasier to learn. If you are a VBA die-hard fan and this chapter didn’t convince youjust yet, I am pretty sure the next part will: there, I will give you an introduction toarray-based calculations before starting our data analysis journey with the pandaslibrary. Let’s get started with Part II by learning a few basics about NumPy!74 | Chapter 3: Getting Started with PythonPART IIIntroduction to pandasCHAPTER 4NumPy FoundationsAs you may recall from Chapter 1, NumPy is the core package for scientific comput‐ing in Python, providing support for array-based calculations and linear algebra. AsNumPy is the backbone of pandas, I am going to introduce its basics in this chapter:after explaining what a NumPy array is, we will look into vectorization and broad‐casting, two important concepts that allow you to write concise mathematical codeand that you will find again in pandas. After that, we’re going to see why NumPyoffers special functions called universal functions before we wrap this chapter up bylearning how to get and set values of an array and by explaining the differencebetween a view and a copy of a NumPy array. Even if we will hardly use NumPydirectly in this book, knowing its basics will make it easier to learn pandas in the nextchapter.Getting Started with NumPyIn this section, we’ll learn about one- and two-dimensional NumPy arrays and what’sbehind the technical terms vectorization, broadcasting, and universal function.NumPy ArrayTo perform array-based calculations with nested lists, as we met them in the lastchapter, you would have to write some sort of loop. For example, to add a number toevery element in a nested list, you can use the following nested list comprehension:In [1]: matrix = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]In [2]: [[i + 1 for i in row] for row in matrix]Out[2]: [[2, 3, 4], [5, 6, 7], [8, 9, 10]]77This isn’t very readable and more importantly, if you do this with big arrays, loopingthrough each element becomes very slow. Depending on your use case and the size ofthe arrays, calculating with NumPy arrays instead of Python lists can make your cal‐culations from a couple of times to around a hundred times faster. NumPy achievesthis performance by making use of code that was written in C or Fortran—these arecompiled programming languages that are much faster than Python. A NumPy arrayis an N-dimensional array for homogenous data. Homogenous means that all ele‐ments in the array need to be of the same data type. Most commonly, you are dealingwith one- and two-dimensional arrays of floats as schematically displayed inFigure 4-1.Figure 4-1. A one-dimensional and two-dimensional NumPy arrayLet’s create a one- and two-dimensional array to work with throughout this chapter:In [3]: # First, let's import NumPy import numpy as npIn [4]: # Constructing an array with a simple list results in a 1d array array1 = np.array([10, 100, 1000.])In [5]: # Constructing an array with a nested list results in a 2d array array2 = np.array([[1., 2., 3.], [4., 5., 6.]])Array DimensionIt’s important to note the difference between a one- and two-dimensional array: a one-dimensional array has only one axis andhence does not have an explicit column or row orientation. Whilethis behaves like arrays in VBA, you may have to get used to it ifyou come from a language like MATLAB, where one-dimensionalarrays always have a column or row orientation.Even if array1 consists of integers except for the last element (which is a float), thehomogeneity of NumPy arrays forces the data type of the array to be float64, whichis capable of accommodating all elements. To learn about an array’s data type, accessits dtype attribute:78 | Chapter 4: NumPy FoundationsIn [6]: array1.dtypeOut[6]: dtype('float64')Since dtype gives you back float64 instead of float which we met in the last chap‐ter, you may have guessed that NumPy uses its own numerical data types, which aremore granular than Python’s data types. This usually isn’t an issue though, as most ofthe time, conversion between the different data types in Python and NumPy happensautomatically. If you ever need to explicitly convert a NumPy data type to one ofPython’s basic data types, simply use the corresponding constructor (I will say moreabout accessing an element from an array shortly):In [7]: float(array1[0])Out[7]: 10.0For a full list of NumPy’s data types, see the NumPy docs. With NumPy arrays, youcan write simple code to perform array-based calculations, as wewill see next.Vectorization and BroadcastingIf you build the sum of a scalar and a NumPy array, NumPy will perform an element-wise operation, which means that you don’t have to loop through the elements your‐self. The NumPy community refers to this as vectorization. It allows you to writeconcise code, practically representing the mathematical notation:In [8]: array2 + 1Out[8]: array([[2., 3., 4.], [5., 6., 7.]])ScalarScalar refers to a basic Python data type like a float or a string. Thisis to differentiate them from data structures with multiple elementslike lists and dictionaries or one- and two-dimensional NumPyarrays.The same principle applies when you work with two arrays: NumPy performs theoperation element-wise:In [9]: array2 * array2Out[9]: array([[ 1., 4., 9.], [16., 25., 36.]])If you use two arrays with different shapes in an arithmetic operation, NumPyextends—if possible—the smaller array automatically across the larger array so thattheir shapes become compatible. This is called broadcasting:Getting Started with NumPy | 79https://oreil.ly/irDyH1 If it’s been a while since your last linear algebra class, you can skip this example—matrix multiplication is notsomething this book builds upon.In [10]: array2 * array1Out[10]: array([[ 10., 200., 3000.], [ 40., 500., 6000.]])To perform matrix multiplications or dot products, use the @ operator:1In [11]: array2 @ array2.T # array2.T is a shortcut for array2.transpose()Out[11]: array([[14., 32.], [32., 77.]])Don’t be intimidated by the terminology I’ve introduced in this section such as scalar,vectorization, or broadcasting! If you have ever worked with arrays in Excel, thisshould all feel very natural as shown in Figure 4-2. The screenshot is taken fromarray_calculations.xlsx, which you will find in the xl directory of the companionrepository.Figure 4-2. Array-based calculations in ExcelYou know now that arrays perform arithmetic operations element-wise, but how canyou apply a function on every element in an array? This is what universal functionsare here for.Universal Functions (ufunc)Universal functions (ufunc) work on every element in a NumPy array. For example, ifyou use Python’s standard square root function from the math module on a NumPyarray, you will get an error:In [12]: import mathIn [13]: math.sqrt(array2) # This will raise en Error80 | Chapter 4: NumPy Foundations---------------------------------------------------------------------------TypeError Traceback (most recent call last) in ----> 1 math.sqrt(array2) # This will raise en ErrorTypeError: only size-1 arrays can be converted to Python scalarsYou could, of course, write a nested loop to get the square root of every element, thenbuild a NumPy array again from the result:In [14]: np.array([[math.sqrt(i) for i in row] for row in array2])Out[14]: array([[1. , 1.41421356, 1.73205081], [2. , 2.23606798, 2.44948974]])This will work in cases where NumPy doesn’t offer a ufunc and the array is smallenough. However, if NumPy has a ufunc, use it, as it will be much faster with bigarrays—apart from being easier to type and read:In [15]: np.sqrt(array2)Out[15]: array([[1. , 1.41421356, 1.73205081], [2. , 2.23606798, 2.44948974]])Some of NumPy’s ufuncs, like sum, are additionally available as array methods: if youwant the sum of each column, do the following:In [16]: array2.sum(axis=0) # Returns a 1d arrayOut[16]: array([5., 7., 9.])The argument axis=0 refers to the axis along the rows while axis=1 refers to the axisalong the columns, as depicted in Figure 4-1. Leaving the axis argument away sumsup the whole array:In [17]: array2.sum()Out[17]: 21.0You will meet more NumPy ufuncs throughout this book, as they can be used withpandas DataFrames.So far, we’ve always worked with the entire array. The next section shows you how tomanipulate parts of an array and introduces a few helpful array constructors.Creating and Manipulating ArraysI’ll start this section by getting and setting specific elements of an array before intro‐ducing a few useful array constructors, including one to create pseudorandom num‐bers that you could use for a Monte Carlo simulation. I’ll wrap this section up byexplaining the difference between a view and a copy of an array.Creating and Manipulating Arrays | 81Getting and Setting Array ElementsIn the last chapter, I showed you how to index and slice lists to get access to specificelements. When you work with nested lists like matrix from the first example in thischapter, you can use chained indexing: matrix[0][0] will get you the first element ofthe first row. With NumPy arrays, however, you provide the index and slice argu‐ments for both dimensions in a single pair of square brackets:numpy_array[row_selection, column_selection]For one-dimensional arrays, this simplifies to numpy_array[selection]. When youselect a single element, you will get back a scalar; otherwise, you will get back a one-or two-dimensional array. Remember that slice notation uses a start index (included)and an end index (excluded) with a colon in between, as in start:end. By leavingaway the start and end index, you are left with a colon, which therefore stands for allrows or all columns in a two-dimensional array. I have visualized a few examples inFigure 4-3, but you may also want to give Figure 4-1 another look, as the indices andaxes are labeled there. Remember, by slicing a column or row of a two-dimensionalarray, you end up with a one-dimensional array, not with a two-dimensional columnor row vector!Figure 4-3. Selecting elements of a NumPy arrayPlay around with the examples shown in Figure 4-3 by running the following code:In [18]: array1[2] # Returns a scalarOut[18]: 1000.0In [19]: array2[0, 0] # Returns a scalarOut[19]: 1.0In [20]: array2[:, 1:] # Returns a 2d arrayOut[20]: array([[2., 3.], [5., 6.]])In [21]: array2[:, 1] # Returns a 1d arrayOut[21]: array([2., 5.])In [22]: array2[1, :2] # Returns a 1d arrayOut[22]: array([4., 5.])82 | Chapter 4: NumPy FoundationsSo far, I have constructed the sample arrays by hand, i.e., by providing numbers in alist. But NumPy also offers a few useful functions to construct arrays.Useful Array ConstructorsNumPy offers a few ways to construct arrays that will also be helpful to create pandasDataFrames, as we will see in Chapter 5. One way to easily create arrays is to use thearange function. This stands for array range and is similar to the built-in range thatwe met in the previous chapter—with the difference that arange returns a NumPyarray. Combining it with reshape allows us to quickly generate an array with thedesired dimensions:In [23]: np.arange(2 * 5).reshape(2, 5) # 2 rows, 5 columnsOut[23]: array([[0, 1, 2, 3, 4], [5, 6, 7, 8, 9]])Another common need, for example for Monte Carlo simulations, is to generatearrays of normally distributed pseudorandom numbers. NumPy makes this easy:In [24]: np.random.randn(2, 3) # 2 rows, 3 columnsOut[24]: array([[-0.30047275, -1.19614685, -0.13652283], [ 1.05769357, 0.03347978, -1.2153504 ]])Other helpful constructors worth exploring are np.ones and np.zeros to createarrays with ones and zeros, respectively, and np.eye to create an identity matrix. We’llcome across some of these constructors again in the next chapter, but for now, let’slearn about the difference between a view and a copy of a NumPy array.View vs. CopyNumPy arrays return views when you slice them. This means that you are workingwith a subset of the original array without copying the data. Setting a value on a viewwill therefore also change the original array:In [25]: array2Out[25]: array([[1., 2.,3.], [4., 5., 6.]])In [26]: subset = array2[:, :2] subsetOut[26]: array([[1., 2.], [4., 5.]])In [27]: subset[0, 0] = 1000In [28]: subsetOut[28]: array([[1000., 2.], [ 4., 5.]])Creating and Manipulating Arrays | 83In [29]: array2Out[29]: array([[1000., 2., 3.], [ 4., 5., 6.]])If that’s not what you want, you would have to change In [26] as follows:subset = array2[:, :2].copy()Working on a copy will leave the original array unchanged.ConclusionIn this chapter, I showed you how to work with NumPy arrays and what’s behindexpressions such as vectorization and broadcasting. Putting these technical termsaside, working with arrays should feel quite intuitive given that they follow the math‐ematical notation very closely. While NumPy is an incredibly powerful library, thereare two main issues when you want to use it for data analysis:• The whole NumPy array needs to be of the same data type. This, for example,means that you can’t perform any of the arithmetic operations we did in thischapter when your array contains a mix of text and numbers. As soon as text isinvolved, the array will have the data type object, which will not allow mathe‐matical operations.• Using NumPy arrays for data analysis makes it hard to know what each columnor row refers to because you typically select columns via their position, such as inarray2[:, 1].pandas has solved these issues by providing smarter data structures on top of NumPyarrays. What they are and how they work is the topic of the next chapter.84 | Chapter 4: NumPy FoundationsCHAPTER 5Data Analysis with pandasThis chapter will introduce you to pandas, the Python Data Analysis Library or—howI like to put it—the Python-based spreadsheet with superpowers. It’s so powerful thatsome of the companies that I worked with have managed to get rid of Excel com‐pletely by replacing it with a combination of Jupyter notebooks and pandas. As areader of this book, however, I assume you will keep Excel, in which case pandas willserve as an interface for getting data in and out of spreadsheets. pandas makes tasksthat are particularly painful in Excel easier, faster, and less error-prone. Some of thesetasks include getting big datasets from external sources and working with statistics,time series, and interactive charts. pandas’ most important superpowers are vectori‐zation and data alignment. As we’ve already seen in the previous chapter with NumPyarrays, vectorization allows you to write concise, array-based code while dataalignment makes sure that there is no data mismatch when you work with multipledatasets.This chapter covers the whole data analysis journey: it starts with cleaning and pre‐paring data before it shows you how to make sense out of bigger datasets via aggrega‐tion, descriptive statistics, and visualization. At the end of the chapter, we’ll see howwe can import and export data with pandas. But first things first—let’s get startedwith an introduction to pandas’ main data structures: DataFrame and Series!DataFrame and SeriesDataFrame and Series are the core data structures in pandas. In this section, I amintroducing them with a focus on the main components of a DataFrame: index, col‐umns, and data. A DataFrame is similar to a two-dimensional NumPy array, but itcomes with column and row labels and each column can hold different data types. Byextracting a single column or row from a DataFrame, you get a one-dimensional Ser‐ies. Again, a Series is similar to a one-dimensional NumPy array with labels. When85you look at the structure of a DataFrame in Figure 5-1, it won’t take a lot of imagina‐tion to see that DataFrames are going to be your Python-based spreadsheets.Figure 5-1. A pandas Series and DataFrameTo show you how easy it is to transition from a spreadsheet to a DataFrame, considerthe following Excel table in Figure 5-2, which shows participants of an online coursewith their score. You will find the corresponding file course_participants.xlsx in the xlfolder of the companion repo.Figure 5-2. course_participants.xlsxTo make this Excel table available in Python, start by importing pandas, then use itsread_excel function, which returns a DataFrame:In [1]: import pandas as pdIn [2]: pd.read_excel("xl/course_participants.xlsx")Out[2]: user_id name age country score continent 0 1001 Mark 55 Italy 4.5 Europe 1 1000 John 33 USA 6.7 America 2 1002 Tim 41 USA 3.9 America 3 1003 Jenny 12 Germany 9.0 EuropeThe read_excel Function with Python 3.9If you are running pd.read_excel with Python 3.9 or above, makesure to use at least pandas 1.2 or you will get an error when readingxlsx files.86 | Chapter 5: Data Analysis with pandas1 pandas 1.0.0 introduced a dedicated string data type to make some operations easier and more consistentwith text. As it is still experimental, I am not going to make use of it in this book.If you run this in a Jupyter notebook, the DataFrame will be nicely formatted as anHTML table, which makes it even closer to how the table looks in Excel. I will spendthe whole of Chapter 7 on reading and writing Excel files with pandas, so this wasonly an introductory example to show you that spreadsheets and DataFrames are,indeed, very similar. Let’s now re-create this DataFrame from scratch without readingit from the Excel file: one way of creating a DataFrame is to provide the data as a nes‐ted list, along with values for columns and index:In [3]: data=[["Mark", 55, "Italy", 4.5, "Europe"], ["John", 33, "USA", 6.7, "America"], ["Tim", 41, "USA", 3.9, "America"], ["Jenny", 12, "Germany", 9.0, "Europe"]] df = pd.DataFrame(data=data, columns=["name", "age", "country", "score", "continent"], index=[1001, 1000, 1002, 1003]) dfOut[3]: name age country score continent 1001 Mark 55 Italy 4.5 Europe 1000 John 33 USA 6.7 America 1002 Tim 41 USA 3.9 America 1003 Jenny 12 Germany 9.0 EuropeBy calling the info method, you will get some basic information, most importantlythe number of data points and the data types for each column:In [4]: df.info()Int64Index: 4 entries, 1001 to 1003Data columns (total 5 columns): # Column Non-Null Count Dtype--- ------ -------------- ----- 0 name 4 non-null object 1 age 4 non-null int64 2 country 4 non-null object 3 score 4 non-null float64 4 continent 4 non-null objectdtypes: float64(1), int64(1), object(3)memory usage: 192.0+ bytesIf you are just interested in the data type of your columns, run df.dtypes instead.Columns with strings or mixed data types will have the data type object. 1 Let us nowhave a closer look at the index and columns of a DataFrame.DataFrame and Series | 87IndexThe row labels of a DataFrame are called index. If you don’t have a meaningful index,leave it away when constructing a DataFrame. pandas will then automatically createan integer index starting at zero. We saw this in the very first example when we readthe DataFrame from the Excel file. An index will allow pandas to look up data fasterand is essential for many common operations, e.g., combining two DataFrames. Youaccess the index object like the following:In [5]: df.indexOut[5]: Int64Index([1001, 1000, 1002, 1003], dtype='int64')If it makes sense, give the index a name. Let’s follow the table in Excel, and give it thename user_id:In [6]: df.index.name = "user_id" dfOut[6]: name age country score continent user_id 1001Mark 55 Italy 4.5 Europe 1000 John 33 USA 6.7 America 1002 Tim 41 USA 3.9 America 1003 Jenny 12 Germany 9.0 EuropeUnlike the primary key of a database, a DataFrame index can have duplicates, butlooking up values may be slower in that case. To turn an index into a regular column use reset_index, and to set a new index use set_index. If you don’t want to loseyour existing index when setting a new one, make sure to reset it first:In [7]: # "reset_index" turns the index into a column, replacing the # index with the default index. This corresponds to the DataFrame # from the beginning that we loaded from Excel. df.reset_index()Out[7]: user_id name age country score continent 0 1001 Mark 55 Italy 4.5 Europe 1 1000 John 33 USA 6.7 America 2 1002 Tim 41 USA 3.9 America 3 1003 Jenny 12 Germany 9.0 EuropeIn [8]: # "reset_index" turns "user_id" into a regular column and # "set_index" turns the column "name" into the index df.reset_index().set_index("name")Out[8]: user_id age country score continent name Mark 1001 55 Italy 4.5 Europe John 1000 33 USA 6.7 America Tim 1002 41 USA 3.9 America Jenny 1003 12 Germany 9.0 Europe88 | Chapter 5: Data Analysis with pandasBy doing df.reset_index().set_index("name"), you are using method chaining:since reset_index() returns a DataFrame, you can directly call another DataFramemethod without having to write out the intermediate result first.DataFrame Methods Return CopiesWhenever you call a method on a DataFrame in the formdf.method_name(), you will get back a copy of the DataFrame withthat method applied, leaving the original DataFrame untouched.We have just done that by calling df.reset_index(). If you wantedto change the original DataFrame, you would have to assign thereturn value back to the original variable like the following:df = df.reset_index()Since we are not doing this, it means that our variable df is stillholding its original data. The next samples also call DataFramemethods, i.e., don’t change the original DataFrame.To change the index, use the reindex method:In [9]: df.reindex([999, 1000, 1001, 1004])Out[9]: name age country score continent user_id 999 NaN NaN NaN NaN NaN 1000 John 33.0 USA 6.7 America 1001 Mark 55.0 Italy 4.5 Europe 1004 NaN NaN NaN NaN NaNThis is a first example of data alignment at work: reindex will take over all rows thatmatch the new index and will introduce rows with missing values (NaN) where noinformation exists. Index elements that you leave away will be dropped. I will intro‐duce NaN properly a bit later in this chapter. Finally, to sort an index, use thesort_index method:In [10]: df.sort_index()Out[10]: name age country score continent user_id 1000 John 33 USA 6.7 America 1001 Mark 55 Italy 4.5 Europe 1002 Tim 41 USA 3.9 America 1003 Jenny 12 Germany 9.0 EuropeIf, instead, you want to sort the rows by one or more columns, use sort_values:In [11]: df.sort_values(["continent", "age"])Out[11]: name age country score continent user_id 1000 John 33 USA 6.7 AmericaDataFrame and Series | 89 1002 Tim 41 USA 3.9 America 1003 Jenny 12 Germany 9.0 Europe 1001 Mark 55 Italy 4.5 EuropeThe sample shows how to sort first by continent, then by age. If you wanted to sortby only one column, you could also provide the column name as a string:df.sort_values("continent")This has covered the basics of how indices work. Let’s now turn our attention to itshorizontal equivalent, the DataFrame columns!ColumnsTo get information about the columns of a DataFrame, run the following code:In [12]: df.columnsOut[12]: Index(['name', 'age', 'country', 'score', 'continent'], dtype='object')If you don’t provide any column names when constructing a DataFrame, pandas willnumber the columns with integers starting at zero. With columns, however, thisis almost never a good idea as columns represent variables and are therefore easyto name. You assign a name to the column headers in the same way we did it withthe index:In [13]: df.columns.name = "properties" dfOut[13]: properties name age country score continent user_id 1001 Mark 55 Italy 4.5 Europe 1000 John 33 USA 6.7 America 1002 Tim 41 USA 3.9 America 1003 Jenny 12 Germany 9.0 EuropeIf you don’t like the column names, rename them:In [14]: df.rename(columns={"name": "First Name", "age": "Age"})Out[14]: properties First Name Age country score continent user_id 1001 Mark 55 Italy 4.5 Europe 1000 John 33 USA 6.7 America 1002 Tim 41 USA 3.9 America 1003 Jenny 12 Germany 9.0 EuropeIf you want to delete columns, use the following syntax (the sample shows you how todrop columns and indices at the same time):In [15]: df.drop(columns=["name", "country"], index=[1000, 1003])90 | Chapter 5: Data Analysis with pandasOut[15]: properties age score continent user_id 1001 55 4.5 Europe 1002 41 3.9 AmericaThe columns and the index of a DataFrame are both represented by an Index object,so you can change your columns into rows and vice versa by transposing yourDataFrame:In [16]: df.T # Shortcut for df.transpose()Out[16]: user_id 1001 1000 1002 1003 properties name Mark John Tim Jenny age 55 33 41 12 country Italy USA USA Germany score 4.5 6.7 3.9 9 continent Europe America America EuropeIt’s worth remembering here that our DataFrame df is still unchanged, as we havenever reassigned the returning DataFrame from the method calls back to the originaldf variable. If you would like to reorder the columns of a DataFrame, you could usethe reindex method that we used with the index, but selecting the columns in thedesired order is often more intuitive:In [17]: df.loc[:, ["continent", "country", "name", "age", "score"]]Out[17]: properties continent country name age score user_id 1001 Europe Italy Mark 55 4.5 1000 America USA John 33 6.7 1002 America USA Tim 41 3.9 1003 Europe Germany Jenny 12 9.0This last example needs quite a few explanations: everything about loc and how dataselection works is the topic of the next section.Data ManipulationReal-world data hardly gets served on a silver platter, so before working with it, youneed to clean it and bring it into a digestible form. We’ll begin this section by lookingat how to select data from a DataFrame, how to change it, and how to deal with miss‐ing and duplicate data. We’ll then perform a few calculations with DataFrames andsee how you work with text data. To wrap this section up, we’ll find out when pandasreturns a view vs. a copy of the data. Quite a few concepts in this section are related towhat we have already seen with NumPy arrays in the last chapter.Data Manipulation | 91Selecting DataLet’s start with accessing data bylabel and position before looking at other methods,including boolean indexing and selecting data by using a MultiIndex.Selecting by labelThe most common way of accessing the data of a DataFrame is by referring to itslabels. Use the attribute loc, which stands for location, to specify which rows and col‐umns you want to retrieve:df.loc[row_selection, column_selection]loc supports the slice notation and therefore accepts a colon to select all rows or col‐umns, respectively. Additionally, you can provide lists with labels as well as a singlecolumn or row name. Have a look at Table 5-1 to see a few examples of how youselect different parts from our sample DataFrame df.Table 5-1. Data selection by labelSelection Return Data Type ExampleSingle value Scalar df.loc[1000, "country"]One column (1d) Series df.loc[:, "country"]One column (2d) DataFrame df.loc[:, ["country"]]Multiple columns DataFrame df.loc[:, ["country", "age"]]Range of columns DataFrame df.loc[:, "name":"country"]One row (1d) Series df.loc[1000, :]One row (2d) DataFrame df.loc[[1000], :]Multiple rows DataFrame df.loc[[1003, 1000], :]Range of rows DataFrame df.loc[1000:1002, :]Label Slicing Has Closed IntervalsUsing slice notation with labels is inconsistent with respect to howeverything else in Python and pandas works: they include the upperend.Applying our knowledge from Table 5-1, let’s use loc to select scalars, Series, andDataFrames:In [18]: # Using scalars for both row and column selection returns a scalar df.loc[1001, "name"]Out[18]: 'Mark'In [19]: # Using a scalar on either the row or column selection returns a Series df.loc[[1001, 1002], "age"]92 | Chapter 5: Data Analysis with pandasOut[19]: user_id 1001 55 1002 41 Name: age, dtype: int64In [20]: # Selecting multiple rows and columns returns a DataFrame df.loc[:1002, ["name", "country"]]Out[20]: properties name country user_id 1001 Mark Italy 1000 John USA 1002 Tim USAIt’s important for you to understand the difference between a DataFrame with one ormore columns and a Series: even with a single column, DataFrames are two-dimensional, while Series are one-dimensional. Both DataFrame and Series have anindex, but only the DataFrame has column headers. When you select a column asSeries, the column header becomes the name of the Series. Many functions or meth‐ods will work on both Series and DataFrame, but when you perform arithmetic calcu‐lations, the behavior differs: with DataFrames, pandas aligns the data according to thecolumn headers—more about that a little later in this chapter.Shortcut for Column SelectionSince selecting columns is such a common operation, pandas offersa shortcut. Instead of:df.loc[:, column_selection]you can write:df[column_selection]For example, df["country"] returns a Series from our sampleDataFrame and df[["name", "country"]] returns a DataFramewith two columns.Selecting by positionSelecting a subset of a DataFrame by position corresponds to what we did at thebeginning of this chapter with NumPy arrays. With DataFrames, however, you haveto use the iloc attribute, which stands for integer location:df.iloc[row_selection, column_selection]When using slices, you deal with the standard half-open intervals. Table 5-2 gives youthe same cases we looked at previously in Table 5-1.Data Manipulation | 93Table 5-2. Data selection by positionSelection Return Data Type ExampleSingle value Scalar df.iloc[1, 2]One column (1d) Series df.iloc[:, 2]One column (2d) DataFrame df.iloc[:, [2]]Multiple columns DataFrame df.iloc[:, [2, 1]]Range of columns DataFrame df.iloc[:, :3]One row (1d) Series df.iloc[1, :]One row (2d) DataFrame df.iloc[[1], :]Multiple rows DataFrame df.iloc[[3, 1], :]Range of rows DataFrame df.iloc[1:3, :]Here is how you use iloc—again with the same samples that we used with locbefore:In [21]: df.iloc[0, 0] # Returns a ScalarOut[21]: 'Mark'In [22]: df.iloc[[0, 2], 1] # Returns a SeriesOut[22]: user_id 1001 55 1002 41 Name: age, dtype: int64In [23]: df.iloc[:3, [0, 2]] # Returns a DataFrameOut[23]: properties name country user_id 1001 Mark Italy 1000 John USA 1002 Tim USASelecting data by label or position is not the only means to access a subset of yourDataFrame. Another important way is to use boolean indexing; let’s see how it works!Selecting by boolean indexingBoolean indexing refers to selecting subsets of a DataFrame with the help of a Seriesor a DataFrame whose data consists of only True or False. Boolean Series are used toselect specific columns and rows of a DataFrame, while boolean DataFrames are usedto select specific values across a whole DataFrame. Most commonly, you will useboolean indexing to filter the rows of a DataFrame. Think of it as the AutoFilter func‐tionality in Excel. For example, this is how you filter your DataFrame so it only showspeople who live in the USA and are older than 40 years:94 | Chapter 5: Data Analysis with pandasIn [24]: tf = (df["age"] > 40) & (df["country"] == "USA") tf # This is a Series with only True/FalseOut[24]: user_id 1001 False 1000 False 1002 True 1003 False dtype: boolIn [25]: df.loc[tf, :]Out[25]: properties name age country score continent user_id 1002 Tim 41 USA 3.9 AmericaThere are two things I need to explain here. First, due to technical limitations, youcan’t use Python’s boolean operators from Chapter 3 with DataFrames. Instead, youneed to use the symbols as shown in Table 5-3.Table 5-3. Boolean operatorsBasic Python Data Types DataFrames and Seriesand &or |not ~Second, if you have more than one condition, make sure to put every boolean expres‐sion in between parentheses so operator precedence doesn’t get in your way: forexample, & has higher operator precedence than ==. Therefore, without parentheses,the expression from the sample would be interpreted as:df["age"] > (40 & df["country"]) == "USA"If you want to filter the index, you can refer to it as df.index:In [26]: df.loc[df.index > 1001, :]Out[26]: properties name age country score continent user_id 1002 Tim 41 USA 3.9 America 1003 Jenny 12 Germany 9.0 EuropeFor what you would use the in operator with basic Python data structures like lists,use isin with a Series. This is how you filter your DataFrame to participants fromItaly and Germany:In [27]: df.loc[df["country"].isin(["Italy", "Germany"]), :]Out[27]: properties name age country score continent user_idData Manipulation | 95 1001 Mark 55 Italy 4.5 Europe 1003 Jenny 12 Germany 9.0 EuropeWhile you use loc to provide a boolean Series, DataFrames offer a special syntaxwithout loc to select values given the full DataFrame of booleans:df[boolean_df]This is especially helpful if you have DataFrames that consist of only numbers. Pro‐viding a DataFrame of booleans returns the DataFrame with NaN wherever theboolean DataFrame is False. Again, a more detailed discussion of NaN will followshortly. Let’s start by creating a new sample DataFrame called rainfall that consistsof only numbers:In [28]: # This could be the yearly rainfall in millimeters rainfall = pd.DataFrame(data={"City 1": [300.1, 100.2], "City 2": [400.3, 300.4], "City 3": [1000.5, 1100.6]}) rainfallOut[28]: City 1 City 2 City 3 0 300.1 400.3 1000.5 1 100.2 300.4 1100.6In [29]: rainfall0 True False False 1 True True FalseIn [30]: rainfall[rainfall90Data Manipulation 91Selecting Data 92Setting Data 97Missing Data 100Duplicate Data 102Arithmetic Operations 103Working with Text Columns 105Applying a Function 105View vs. Copy 107Combining DataFrames 107Concatenating 108Joining and Merging 109Descriptive Statistics and Data Aggregation 111Descriptive Statistics 111Grouping 112Pivoting and Melting 113Plotting 115Matplotlib 115Plotly 117Importing and Exporting DataFrames 119Exporting CSV Files 120Importing CSV Files 121Conclusion 1236. Time Series Analysis with pandas. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125DatetimeIndex 126Creating a DatetimeIndex 126Filtering a DatetimeIndex 128Working with Time Zones 129Common Time Series Manipulations 131Table of Contents | viiShifting and Percentage Changes 131Rebasing and Correlation 133Resampling 136Rolling Windows 137Limitations with pandas 138Conclusion 139Part III. Reading and Writing Excel Files Without Excel7. Excel File Manipulation with pandas. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143Case Study: Excel Reporting 143Reading and Writing Excel Files with pandas 147The read_excel Function and ExcelFile Class 147The to_excel Method and ExcelWriter Class 152Limitations When Using pandas with Excel Files 154Conclusion 1548. Excel File Manipulation with Reader and Writer Packages. . . . . . . . . . . . . . . . . . . . . . 155The Reader and Writer Packages 155When to Use Which Package 156The excel.py Module 157OpenPyXL 159XlsxWriter 163pyxlsb 165xlrd, xlwt, and xlutils 166Advanced Reader and Writer Topics 169Working with Big Excel Files 169Formatting DataFrames in Excel 173Case Study (Revisited): Excel Reporting 178Conclusion 179Part IV. Programming the Excel Application with xlwings9. Excel Automation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183Getting Started with xlwings 184Using Excel as Data Viewer 184The Excel Object Model 186Running VBA Code 193Converters, Options, and Collections 194Working with DataFrames 195viii | Table of ContentsConverters and Options 196Charts, Pictures, and Defined Names 198Case Study (Re-Revisited): Excel Reporting 202Advanced xlwings Topics 204xlwings Foundationsscore continent birth year user_id 1001 Mark 55 Italy 4.5 Europe 1966 1000 John 33 USA 6.7 America 1988 1002 Tim 41 USA 3.9 America 1980 1003 Jenny 12 Germany 9.0 Europe 2009I will show you more about calculating with DataFrames in a moment, but before weget there, do you remember that I have used NaN a few times already? The next sec‐tion will finally give you more context around the topic of missing data.Missing DataMissing data can be a problem as it has the potential to bias the results of your dataanalysis, thereby making your conclusions less robust. Nevertheless, it’s very commonto have gaps in your datasets that you will have to deal with. In Excel, you usuallyhave to deal with empty cells or #N/A errors, but pandas uses NumPy’s np.nan formissing data, displayed as NaN. NaN is the floating-point standard for Not-a-Number.For timestamps, pd.NaT is used instead, and for text, pandas uses None. Using None ornp.nan, you can introduce missing values:In [44]: df2 = df.copy() # Let's start with a fresh copy df2.loc[1000, "score"] = None df2.loc[1003, :] = None df2Out[44]: properties name age country score continent user_id 1001 Mark 55.0 Italy 4.5 Europe 1000 John 33.0 USA NaN America100 | Chapter 5: Data Analysis with pandas 1002 Tim 41.0 USA 3.9 America 1003 None NaN None NaN NoneTo clean a DataFrame, you often want to remove rows with missing data. This is assimple as:In [45]: df2.dropna()Out[45]: properties name age country score continent user_id 1001 Mark 55.0 Italy 4.5 Europe 1002 Tim 41.0 USA 3.9 AmericaIf, however, you only want to remove rows where all values are missing, use the howparameter:In [46]: df2.dropna(how="all")Out[46]: properties name age country score continent user_id 1001 Mark 55.0 Italy 4.5 Europe 1000 John 33.0 USA NaN America 1002 Tim 41.0 USA 3.9 AmericaTo get a boolean DataFrame or Series depending on whether there is NaN or not, useisna:In [47]: df2.isna()Out[47]: properties name age country score continent user_id 1001 False False False False False 1000 False False False True False 1002 False False False False False 1003 True True True True TrueTo fill missing values, use fillna. For example, to replace NaN in the score columnwith its mean (I will introduce descriptive statistics like mean shortly):In [48]: df2.fillna({"score": df2["score"].mean()})Out[48]: properties name age country score continent user_id 1001 Mark 55.0 Italy 4.5 Europe 1000 John 33.0 USA 4.2 America 1002 Tim 41.0 USA 3.9 America 1003 None NaN None 4.2 NoneMissing data isn’t the only condition that requires us to clean our dataset. The same istrue for duplicate data, so let’s see what our options are!Data Manipulation | 101Duplicate DataLike missing data, duplicates negatively impact the reliability of your analysis. To getrid of duplicate rows, use the drop_duplicates method. Optionally, you can providea subset of the columns as argument:In [49]: df.drop_duplicates(["country", "continent"])Out[49]: properties name age country score continent user_id 1001 Mark 55 Italy 4.5 Europe 1000 John 33 USA 6.7 America 1003 Jenny 12 Germany 9.0 EuropeBy default, this will leave the first occurrence. To find out if a certain column containsduplicates or to get its unique values, use the following two commands (use df.indexinstead of df["country"] if you wanted to run this on the index instead):In [50]: df["country"].is_uniqueOut[50]: FalseIn [51]: df["country"].unique()Out[51]: array(['Italy', 'USA', 'Germany'], dtype=object)And finally, to understand which rows are duplicates, use the duplicated method,which returns a boolean Series: by default, it uses the parameter keep="first", whichkeeps the first occurrence and marks only duplicates with True. By setting the param‐eter keep=False, it will return True for all rows, including its first occurrence, mak‐ing it easy to get a DataFrame with all duplicate rows. In the following example,we look at the country column for duplicates, but in reality, you often look at theindex or entire rows. In this case, you’d have to use df.index.duplicated() ordf.duplicated() instead:In [52]: # By default, it marks only duplicates as True, i.e. # without the first occurrence df["country"].duplicated()Out[52]: user_id 1001 False 1000 False 1002 True 1003 False Name: country, dtype: boolIn [53]: # To get all rows where "country" is duplicated, use # keep=False df.loc[df["country"].duplicated(keep=False), :]Out[53]: properties name age country score continent user_id102 | Chapter 5: Data Analysis with pandas 1000 John 33 USA 6.7 America 1002 Tim 41 USA 3.9 AmericaOnce you have cleaned your DataFrames by removing missing and duplicate data,you might want to perform some arithmetic operations—the next section gives youan introduction to how this works.Arithmetic OperationsLike NumPy arrays, DataFrames and Series make use of vectorization. For example,to add a number to every value in the rainfall DataFrame, simply do the following:In [54]: rainfallOut[54]: City 1 City 2 City 3 0 300.1 400.3 1000.5 1 100.2 300.4 1100.6In [55]: rainfall + 100Out[55]: City 1 City 2 City 3 0 400.1 500.3 1100.5 1 200.2 400.4 1200.6However, the true power of pandas is its automatic data alignment mechanism: whenyou use arithmetic operators with more than one DataFrame, pandas automaticallyaligns them by their columns and row indices. Let’s create a second DataFrame withsome of the same row and column labels. We then build the sum:In [56]: more_rainfall = pd.DataFrame(data=[[100, 200], [300, 400]], index=[1, 2], columns=["City 1", "City 4"]) more_rainfallOut[56]: City 1 City 4 1 100 200 2 300 400In [57]: rainfall + more_rainfallOut[57]: City 1 City 2 City 3 City 4 0 NaN NaN NaN NaN 1 200.2 NaN NaN NaN 2 NaN NaN NaN NaNThe index and columns of the resulting DataFrame are the union of the indices andcolumns of the two DataFrames: the fields that have a value in both DataFrames showthe sum, while the rest of the DataFrame shows NaN. This may be something you haveto get used to if you come from Excel, where empty cells are automatically turnedinto zeros when you use them in arithmetic operations. To get the same behavior asin Excel, use the add method with a fill_value to replace NaN values with zeros:In [58]: rainfall.add(more_rainfall, fill_value=0)Data Manipulation | 103Out[58]: City 1 City 2 City 3 City 4 0 300.1 400.3 1000.5 NaN 1 200.2 300.4 1100.6 200.0 2 300.0 NaN NaN 400.0This works accordingly for the other arithmetic operators as shown in Table 5-4.Table 5-4. Arithmetic operatorsOperator Method* mul+ add- sub/ div** powWhen you have a DataFrame and a Series in your calculation,by default the Series is broadcast along the index:In [59]: # A Series taken from a row rainfall.loc[1, :]Out[59]: City 1 100.2 City 2 300.4 City 3 1100.6 Name: 1, dtype: float64In [60]: rainfall + rainfall.loc[1, :]Out[60]: City 1 City 2 City 3 0 400.3 700.7 2101.1 1 200.4 600.8 2201.2Hence, to add a Series column-wise, you need to use the add method with an explicitaxis argument:In [61]: # A Series taken from a column rainfall.loc[:, "City 2"]Out[61]: 0 400.3 1 300.4 Name: City 2, dtype: float64In [62]: rainfall.add(rainfall.loc[:, "City 2"], axis=0)Out[62]: City 1 City 2 City 3 0 700.4 800.6 1400.8 1 400.6 600.8 1401.0While this section is about DataFrames with numbers and how they behave in arith‐metic operations, the next section shows your options when it comes to manipulatingtext in DataFrames.104 | Chapter 5: Data Analysis with pandasWorking with Text ColumnsAs we have seen at the beginning of this chapter, columns with text or mixed datatypes have the data type object. To perform operations on columns with text strings,use the str attribute that gives you access to Python’s string methods. We havealready met a few string methods in Chapter 3, but it won’t hurt to have a look at theavailable methods in the Python docs. For example, to remove leading and trailingwhite space, use the strip method; to make all first letters capitalized, there is thecapitalize method. Chaining these together will clean up messy text columns thatare often the result of manual data entry:In [63]: # Let's create a new DataFrame users = pd.DataFrame(data=[" mArk ", "JOHN ", "Tim", " jenny"], columns=["name"]) usersOut[63]: name 0 mArk 1 JOHN 2 Tim 3 jennyIn [64]: users_cleaned = users.loc[:, "name"].str.strip().str.capitalize() users_cleanedOut[64]: 0 Mark 1 John 2 Tim 3 Jenny Name: name, dtype: objectOr, to find all names that start with a “J”:In [65]: users_cleaned.str.startswith("J")Out[65]: 0 False 1 True 2 False 3 True Name: name, dtype: boolThe string methods are easy to use, but sometimes you may need to manipulate aDataFrame in a way that isn’t built-in. In that case, create your own function andapply it to your DataFrame, as the next section shows.Applying a FunctionDataFrames offer the applymap method, which will apply a function to every individ‐ual element, something that is useful if there are no NumPy ufuncs available. Forexample, there are no ufuncs for string formatting, so we can format every element ofa DataFrame like so:Data Manipulation | 105https://oreil.ly/-e7SCIn [66]: rainfallOut[66]: City 1 City 2 City 3 0 300.1 400.3 1000.5 1 100.2 300.4 1100.6In [67]: def format_string(x): return f"{x:,.2f}"In [68]: # Note that we pass in the function without calling it, # i.e., format_string and not format_string()! rainfall.applymap(format_string)Out[68]: City 1 City 2 City 3 0 300.10 400.30 1,000.50 1 100.20 300.40 1,100.60To break this down: the following f-string returns x as a string: f"{x}". To add for‐matting, append a colon to the variable followed by the formatting string ,.2f. Thecomma is the thousands separator and .2f means fixed-point notation with two digitsfollowing the decimal point. To get more details about how to format strings, pleaserefer to the Format Specification Mini-Language, which is part of the Python docu‐mentation.For this sort of use case, lambda expressions (see sidebar) are widely used as theyallow you to write the same in a single line without having to define a separate func‐tion. With lambda expressions, we can rewrite the previous example as the following:In [69]: rainfall.applymap(lambda x: f"{x:,.2f}")Out[69]: City 1 City 2 City 3 0 300.10 400.30 1,000.50 1 100.20 300.40 1,100.60Lambda ExpressionsPython allows you to define a function in a single line via lambda expressions. Lambdaexpressions are anonymous functions, which means that it is a function without aname. Consider this function:def function_name(arg1, arg2, ...): return return_valueThis function can be rewritten as a lambda expression like this:lambda arg1, arg2, ...: return_valueIn essence, you replace def with lambda, leave away the return keyword and thefunction name, and put everything on one line. As we saw with the applymap method,this can be really convenient in this case as we don’t need to define a function forsomething that’s just being used a single time.106 | Chapter 5: Data Analysis with pandashttps://oreil.ly/NgsG8I have now mentioned all the important data manipulation methods, but before wemove on, it’s important to understand when pandas uses a view of a DataFrame andwhen it uses a copy.View vs. CopyYou may remember from the previous chapter that slicing NumPy arrays returns aview. With DataFrames, it’s unfortunately more complicated: it isn’t always easily pre‐dictable whether loc and iloc return views or copies, which makes it one of themore confusing topics. Since it’s a big difference whether you are changing the viewor a copy of a DataFrame, pandas raises the following warning regularly when itthinks that you are setting the data in an unintended way: SettingWithCopyWarning.To circumvent this rather cryptic warning, here is some advice:• Set values on the original DataFrame, not on a DataFrame that has been sliced offanother DataFrame• If you want to have an independent DataFrame after slicing, make an explicitcopy:selection = df.loc[:, ["country", "continent"]].copy()While things are complicated with loc and iloc, it’s worth remembering that allDataFrame methods such as df.dropna() or df.sort_values("column_name")always return a copy.So far, we’ve mostly worked with one DataFrame at a time. The next section showsyou various ways to combine multiple DataFrames into one, a very common task forwhich pandas offers powerful tools.Combining DataFramesCombining different datasets in Excel can be a cumbersome task and typicallyinvolves a lot of VLOOKUP formulas. Fortunately, combining DataFrames is one of pan‐das’ killer features where its data alignment capabilities will make your life really easy,thereby greatly reducing the possibility of introducing errors. Combining and merg‐ing DataFrames can be done in various ways; this section looks at just the most com‐mon cases using concat, join, and merge. While they have an overlap, each functionmakes a specific task very simple. I will start with the concat function, then explainthe different options with join, and conclude by introducing merge, the most genericfunction of the three.Combining DataFrames | 107ConcatenatingTo simply glue multiple DataFrames together, the concat function is your best friend.As you can tell by the name of the function, this process has the technical name con‐catenation. By default, concat glues DataFrames together along the rows and alignsthe columns automatically. In the following example, I create another DataFrame,more_users, and attach it to the bottom of our sample DataFrame df:In [70]: data=[[15, "France", 4.1, "Becky"], [44, "Canada", 6.1, "Leanne"]] more_users = pd.DataFrame(data=data, columns=["age", "country", "score", "name"], index=[1000, 1011]) more_usersOut[70]: age country score name 1000 15 France 4.1 Becky 1011 44 Canada 6.1 LeanneIn [71]: pd.concat([df, more_users], axis=0)Out[71]: name age countryscore continent 1001 Mark 55 Italy 4.5 Europe 1000 John 33 USA 6.7 America 1002 Tim 41 USA 3.9 America 1003 Jenny 12 Germany 9.0 Europe 1000 Becky 15 France 4.1 NaN 1011 Leanne 44 Canada 6.1 NaNNote that you now have duplicate index elements, as concat glues the data togetheron the indicated axis (rows) and only aligns the data on the other one (columns),thereby matching the column names automatically—even if they are not in the sameorder in the two DataFrames! If you want to glue two DataFrames together along thecolumns, set axis=1:In [72]: data=[[3, 4], [5, 6]] more_categories = pd.DataFrame(data=data, columns=["quizzes", "logins"], index=[1000, 2000]) more_categoriesOut[72]: quizzes logins 1000 3 4 2000 5 6In [73]: pd.concat([df, more_categories], axis=1)Out[73]: name age country score continent quizzes logins 1000 John 33.0 USA 6.7 America 3.0 4.0 1001 Mark 55.0 Italy 4.5 Europe NaN NaN 1002 Tim 41.0 USA 3.9 America NaN NaN108 | Chapter 5: Data Analysis with pandas 1003 Jenny 12.0 Germany 9.0 Europe NaN NaN 2000 NaN NaN NaN NaN NaN 5.0 6.0The special and very useful feature of concat is that it accepts more than two Data‐Frames. We will use this in the next chapter to make a single DataFrame out of multi‐ple CSV files:pd.concat([df1, df2, df3, ...])On the other hand, join and merge only work with two DataFrames, as we’ll see next.Joining and MergingWhen you join two DataFrames, you combine the columns of each DataFrame into anew DataFrame while deciding what happens with the rows by relying on set theory.If you have worked with relational databases before, it’s the same concept as the JOINclause in SQL queries. Figure 5-3 shows how the four join types (that is the inner, left,right, and outer join) work by using two sample DataFrames, df1 and df2.Figure 5-3. Join typesWith join, pandas uses the indices of both DataFrames to align the rows. An innerjoin returns a DataFrame with only those rows where the indices overlap. A left jointakes all the rows from the left DataFrame df1 and matches the rows from the rightDataFrame df2 on the index. Where df2 doesn’t have a matching row, pandas will fillin NaN. The left join corresponds to the VLOOKUP case in Excel. The right join takes allrows from the right table df2 and matches them with rows from df1 on the index.And finally, the outer join, which is short for full outer join, takes the union of indicesfrom both DataFrames and matches the values where it can. Table 5-5 is the equiva‐lent of Figure 5-3 in text form.Combining DataFrames | 109Table 5-5. Join typesType Descriptioninner Only rows whose index exists in both DataFramesleft All rows from the left DataFrame, matching rows from the right DataFrameright All rows from the right DataFrame, matching rows from the left DataFrameouter The union of row indices from both DataFramesLet’s see how this works in practice, bringing the examples from Figure 5-3 to life:In [74]: df1 = pd.DataFrame(data=[[1, 2], [3, 4], [5, 6]], columns=["A", "B"]) df1Out[74]: A B 0 1 2 1 3 4 2 5 6In [75]: df2 = pd.DataFrame(data=[[10, 20], [30, 40]], columns=["C", "D"], index=[1, 3]) df2Out[75]: C D 1 10 20 3 30 40In [76]: df1.join(df2, how="inner")Out[76]: A B C D 1 3 4 10 20In [77]: df1.join(df2, how="left")Out[77]: A B C D 0 1 2 NaN NaN 1 3 4 10.0 20.0 2 5 6 NaN NaNIn [78]: df1.join(df2, how="right")Out[78]: A B C D 1 3.0 4.0 10 20 3 NaN NaN 30 40In [79]: df1.join(df2, how="outer")Out[79]: A B C D 0 1.0 2.0 NaN NaN 1 3.0 4.0 10.0 20.0 2 5.0 6.0 NaN NaN 3 NaN NaN 30.0 40.0If you want to join on one or more DataFrame columns instead of relying on theindex, use merge instead of join. merge accepts the on argument to provide one or110 | Chapter 5: Data Analysis with pandasmore columns as the join condition: these columns, which have to exist on both Data‐Frames, are used to match the rows:In [80]: # Add a column called "category" to both DataFrames df1["category"] = ["a", "b", "c"] df2["category"] = ["c", "b"]In [81]: df1Out[81]: A B category 0 1 2 a 1 3 4 b 2 5 6 cIn [82]: df2Out[82]: C D category 1 10 20 c 3 30 40 bIn [83]: df1.merge(df2, how="inner", on=["category"])Out[83]: A B category C D 0 3 4 b 30 40 1 5 6 c 10 20In [84]: df1.merge(df2, how="left", on=["category"])Out[84]: A B category C D 0 1 2 a NaN NaN 1 3 4 b 30.0 40.0 2 5 6 c 10.0 20.0Since join and merge accept quite a few optional arguments to accommodate morecomplex scenarios, I invite you to have a look at the official documentation to learnmore about them.You know now how to manipulate one or more DataFrames, which brings us to thenext step in our data analysis journey: making sense of data.Descriptive Statistics and Data AggregationOne way to make sense of big datasets is to compute a descriptive statistic like thesum or the mean on either the whole dataset or on meaningful subsets. This sectionstarts by looking at how this works with pandas before it introduces two ways toaggregate data into subsets: the groupby method and the pivot_table function.Descriptive StatisticsDescriptive statistics allows you to summarize datasets by using quantitative measures.For example, the number of data points is a simple descriptive statistic. Averages likemean, median, or mode are other popular examples. DataFrames and Series allowDescriptive Statistics and Data Aggregation | 111https://oreil.ly/OZ4WVyou to access descriptive statistics conveniently via methods like sum, mean, andcount, to name just a few. You will meet many of them throughout this book, and thefull list is available via the pandas documentation. By default, they return a Seriesalong axis=0, which means you get the statistic of the columns:In [85]: rainfallOut[85]: City 1 City 2 City 3 0 300.1 400.3 1000.5 1 100.2 300.4 1100.6In [86]: rainfall.mean()Out[86]: City 1 200.15 City 2 350.35 City 3 1050.55 dtype: float64If you want the statistic per row, provide the axis argument:In [87]: rainfall.mean(axis=1)Out[87]: 0 566.966667 1 500.400000 dtype: float64By default, missing values are not included in descriptive statistics like sum or mean.This is in line with how Excel treats empty cells, so using Excel’s AVERAGE formula ona range with empty cells will give you the same result as the mean method applied on aSeries with the same numbers and NaN values instead of empty cells.Getting a statistic across all rows of a DataFrame is sometimes not good enough andyou need more granular information—the mean per category, for example. Let’s seehow it’s done!GroupingUsing our sample DataFrame df again, let’s find out the average score per continent!To do this, you first group the rows by continent and subsequently apply the meanmethod, which will calculate the mean per group. All nonnumeric columns are auto‐matically excluded:In [88]:df.groupby(["continent"]).mean()Out[88]: properties age score continent America 37.0 5.30 Europe 33.5 6.75If you include more than one column, the resulting DataFrame will have a hierarchi‐cal index—the MultiIndex we met earlier on:In [89]: df.groupby(["continent", "country"]).mean()112 | Chapter 5: Data Analysis with pandashttps://oreil.ly/t2q9QOut[89]: properties age score continent country America USA 37 5.3 Europe Germany 12 9.0 Italy 55 4.5Instead of mean, you can use most of the descriptive statistics that pandas offers and ifyou want to use your own function, use the agg method. For example, here is howyou get the difference between the maximum and minimum value per group:In [90]: df.groupby(["continent"]).agg(lambda x: x.max() - x.min())Out[90]: properties age score continent America 8 2.8 Europe 43 4.5A popular way to get statistics per group in Excel is to use pivot tables. They intro‐duce a second dimension and are great to look at your data from different perspec‐tives. pandas has a pivot table functionality, too, as we will see next.Pivoting and MeltingIf you are using pivot tables in Excel, you will have no trouble applying pandas’pivot_table function, as it works in largely the same way. The data in the followingDataFrame is organized similarly to how records are typically stored in a database;each row shows a sales transaction for a specific fruit in a certain region:In [91]: data = [["Oranges", "North", 12.30], ["Apples", "South", 10.55], ["Oranges", "South", 22.00], ["Bananas", "South", 5.90], ["Bananas", "North", 31.30], ["Oranges", "North", 13.10]] sales = pd.DataFrame(data=data, columns=["Fruit", "Region", "Revenue"]) salesOut[91]: Fruit Region Revenue 0 Oranges North 12.30 1 Apples South 10.55 2 Oranges South 22.00 3 Bananas South 5.90 4 Bananas North 31.30 5 Oranges North 13.10To create a pivot table, you provide the DataFrame as the first argument to thepivot_table function. index and columns define which column of the DataFramewill become the pivot table’s row and column labels, respectively. values are going tobe aggregated into the data part of the resulting DataFrame by using the aggfunc, aDescriptive Statistics and Data Aggregation | 113function that can be provided as a string or NumPy ufunc. And finally, margins cor‐respond to Grand Total in Excel, i.e., if you leave margins and margins_name away,the Total column and row won’t be shown:In [92]: pivot = pd.pivot_table(sales, index="Fruit", columns="Region", values="Revenue", aggfunc="sum", margins=True, margins_name="Total") pivotOut[92]: Region North South Total Fruit Apples NaN 10.55 10.55 Bananas 31.3 5.90 37.20 Oranges 25.4 22.00 47.40 Total 56.7 38.45 95.15In summary, pivoting your data means to take the unique values of a column (Regionin our case) and turn them into the column headers of the pivot table, thereby aggre‐gating the values from another column. This makes it easy to read off summaryinformation across the dimensions of interest. In our pivot table, you instantly seethat there were no apples sold in the north region and that in the south region, mostrevenues come from oranges. If you want to go the other way around and turn thecolumn headers into the values of a single column, use melt. In that sense, melt is theopposite of the pivot_table function:In [93]: pd.melt(pivot.iloc[:-1,:-1].reset_index(), id_vars="Fruit", value_vars=["North", "South"], value_name="Revenue")Out[93]: Fruit Region Revenue 0 Apples North NaN 1 Bananas North 31.30 2 Oranges North 25.40 3 Apples South 10.55 4 Bananas South 5.90 5 Oranges South 22.00Here, I am providing our pivot table as the input, but I am using iloc to get rid of thetotal row and column. I also reset the index so that all information is available as reg‐ular columns. I then provide id_vars to indicate the identifiers and value_vars todefine which columns I want to “unpivot.” Melting can be useful if you want to pre‐pare the data so it can be stored back to a database that expects it in this format.Working with aggregated statistics helps you understand your data, but nobody likesto read a page full of numbers. To make information easily understandable, nothingworks better than creating visualizations, which is our next topic. While Excel usesthe term charts, pandas generally refers to them as plots. I will use these terms inter‐changeably in this book.114 | Chapter 5: Data Analysis with pandasPlottingPlotting allows you to visualize the findings of your data analysis and may well be themost important step in the whole process. For plotting, we’re going to use two libra‐ries: we start by looking at Matplotlib, pandas’ default plotting library, before we focuson Plotly, a modern plotting library that gives us a more interactive experience inJupyter notebooks.MatplotlibMatplotlib is a plotting package that has been around for a long time and is includedin the Anaconda distribution. With it, you can generate plots in a variety of formats,including vector graphics for high-quality printing. When you call the plot methodof a DataFrame, pandas will produce a Matplotlib plot by default.To use Matplotlib in a Jupyter notebook, you need to first run one of two magic com‐mands (see the sidebar “Magic Commands” on page 116): %matplotlib inline or%matplotlib notebook. They configure the notebook so that plots can be displayedin the notebook itself. The latter command adds a bit more interactivity, allowing youto change the size or zoom factor of the chart. Let’s get started and create a first plotwith pandas and Matplotlib (see Figure 5-4):In [94]: import numpy as np %matplotlib inline # Or %matplotlib notebookIn [95]: data = pd.DataFrame(data=np.random.rand(4, 4) * 100000, index=["Q1", "Q2", "Q3", "Q4"], columns=["East", "West", "North", "South"]) data.index.name = "Quarters" data.columns.name = "Region" dataOut[95]: Region East West North South Quarters Q1 23254.220271 96398.309860 16845.951895 41671.684909 Q2 87316.022433 45183.397951 15460.819455 50951.465770 Q3 51458.760432 3821.139360 77793.393899 98915.952421 Q4 64933.848496 7600.277035 55001.831706 86248.512650In [96]: data.plot() # Shortcut for data.plot.line()Out[96]: Plotting | 115Figure 5-4. Matplotlib plotNote that in this example, I have used a NumPy array to construct a pandas Data‐Frame. Providing NumPy arrays allows you to leverage NumPy’s constructors that wemet in the last chapter; here, we use NumPy to generate a pandas DataFrame basedon pseudorandom numbers. Therefore, when you run the sample on your end, youwill get different values.Magic CommandsThe %matplotlib inline command we used to make Matplotlib work properly withJupyter notebooks is a magic command. Magic commands are a set of simple com‐mands that cause a Jupyter notebook cell to behave in a certain way or make cumber‐some tasks so easy that it almost feels like magic. You write magic commands in cellslike Python code, but they either start with %% or %. Commands that affectthe wholecell start with %%, and commands that only affect a single line in a cell start with %.We will see more magic commands in the next chapters, but if you want to list all cur‐rently available magic commands, run %lsmagic, and for a detailed description, run%magic.Even if you use the magic command %matplotlib notebook, you will probablynotice that Matplotlib was originally designed for static plots rather than for an inter‐active experience on a web page. That’s why we’re going to use Plotly next, a plottinglibrary designed for the web.116 | Chapter 5: Data Analysis with pandasPlotlyPlotly is a JavaScript-based library and can—since version 4.8.0—be used as a pandasplotting backend with great interactivity: you can easily zoom in, click on the legendto select or deselect a category, and get tooltips with more info about the data pointyou’re hovering over. Plotly is not included in the Anaconda installation, so if youhaven’t installed it yet, do so now by running the following command:(base)> conda install plotlyOnce you run the following cell, the plotting backend of the whole notebook will beset to Plotly and if you would rerun the previous cell, it would also be rendered as aPlotly chart. For Plotly, instead of running a magic command, you just need to set itas backend before being able to plot Figures 5-5 and 5-6:In [97]: # Set the plotting backend to Plotly pd.options.plotting.backend = "plotly"In [98]: data.plot()Figure 5-5. Plotly line plotIn [99]: # Display the same data as bar plot data.plot.bar(barmode="group")Plotting | 117Figure 5-6. Plotly bar plotDifferences in Plotting BackendsIf you use Plotly as plotting backend, you’ll need to check theaccepted arguments of the plot methods directly on the Plotly docs.For example, you can take a look at the barmode=group argumentin Plotly’s bar charts documentation.pandas and the underlying plotting libraries offer a wealth of chart types and optionsto format the charts in almost any desired way. It’s also possible to arrange multipleplots into a series of subplots. As an overview, Table 5-6 shows the available plottypes.Table 5-6. pandas plot typesType Descriptionline Line Chart, default when running df.plot()bar Vertical bar chartbarh Horizontal bar charthist Histogrambox Box plotkde Density plot, can also be used via densityarea Area chartscatter Scatter plothexbin Hexagonal bin plotspie Pie chart118 | Chapter 5: Data Analysis with pandashttps://oreil.ly/EkurdOn top of that, pandas offers some higher-level plotting tools and techniques that aremade up of multiple individual components. For details, see the pandas visualizationdocumentation.Other Plotting LibrariesThe scientific visualization landscape in Python is very active, and besides Matplotliband Plotly, there are many other high-quality options to choose from that may be thebetter option for certain use cases:SeabornSeaborn is built on top of Matplotlib. It improves the default style and adds addi‐tional plots like heatmaps, which often simplify your work: you can createadvanced statistical plots with only a few lines of code.BokehBokeh is similar to Plotly in technology and functionality: it’s based on JavaScriptand therefore also works great for interactive charts in Jupyter notebooks. Bokehis included in Anaconda.AltairAltair is a library for statistical visualizations based on the Vega project. Altair isalso JavaScript-based and offers some interactivity like zooming.HoloViewsHoloViews is another JavaScript-based package that focuses on making dataanalysis and visualization easy. With a few lines of code, you can achieve complexstatistical plots.We will create more plots in the next chapter to analyze time series, but before we getthere, let’s wrap this chapter up by learning how we can import and export data withpandas!Importing and Exporting DataFramesSo far, we constructed DataFrames from scratch using nested lists, dictionaries, orNumPy arrays. These techniques are important to know, but typically, the data isalready available and you simply need to turn it into a DataFrame. To do this, pandasoffers various reader functions. But even if you need to access a proprietary systemfor which pandas doesn’t offer a built-in reader, you often have a Python package toconnect to that system, and once you have the data, it’s easy enough to turn it into aDataFrame. In Excel, data import is the type of work you would usually handle withPower Query.Importing and Exporting DataFrames | 119https://oreil.ly/FxYg9https://oreil.ly/FxYg9https://oreil.ly/a3U1thttps://docs.bokeh.orghttps://oreil.ly/t06t7https://oreil.ly/RN6A7https://holoviews.orgAfter analyzing and changing your dataset, you might want to push the results backinto a database or export it to a CSV file or—given the title of the book—present it inan Excel workbook to your manager. To export pandas DataFrames, use one of theexporter methods that DataFrames offer. Table 5-7 shows an overview of the mostcommon import and export methods.Table 5-7. Importing and exporting DataFramesData format/system Import: pandas (pd) function Export: DataFrame (df) methodCSV files pd.read_csv df.to_csvJSON pd.read_json df.to_jsonHTML pd.read_html df.to_htmlClipboard pd.read_clipboard df.to_clipboardExcel files pd.read_excel df.to_excelSQL Databases pd.read_sql df.to_sqlWe will meet pd.read_sql and pd.to_sql in Chapter 11, where we will use them aspart of a case study. And since I am going to dedicate the whole of Chapter 7 to thetopic of reading and writing Excel files with pandas, I will focus on importing andexporting CSV files in this section. Let’s start with exporting an existing DataFrame!Exporting CSV FilesIf you need to pass a DataFrame to a colleague who might not use Python or pandas,passing it in the form of a CSV file is usually a good idea: pretty much every programknows how to import them. To export our sample DataFrame df to a CSV file, usethe to_csv method:In [100]: df.to_csv("course_participants.csv")If you wanted to store the file in a different directory, supply the full path as a rawstring, e.g., r"C:\path\to\desired\location\msft.csv".Use Raw Strings for File Paths on WindowsIn strings, the backslash is used to escape certain characters. That’swhy to work with file paths on Windows, you either need to usedouble backslashes (C:\\path\\to\\file.csv) or prefix the stringwith an r to turn it into a raw string that interprets the charactersliterally. This isn’t an issue on macOS or Linux, as they use forwardslashes in paths.By providing only the file name as I do, it will produce the file course_participants.csvin the same directory as the notebook with the following content:120 | Chapter 5: Data Analysis with pandasuser_id,name,age,country,score,continent1001,Mark,55,Italy,4.5,Europe1000,John,33,USA,6.7,America1002,Tim,41,USA,3.9,America1003,Jenny,12,Germany,9.0,EuropeNow that you know how to use the df.to_csv method, let’s see how importing a CSVfile works!Importing CSV FilesImporting a local CSV file is as easy as providing its path to the read_csv function.MSFT.csv is a CSV file that I downloaded from Yahoo! Finance and it contains thedaily historical stock prices for Microsoft—you’ll find it in the companion repository,in the csv folder:In [101]: msft = pd.read_csv("csv/MSFT.csv")Often, you will need to supply a few more parameters to read_csv than just the filename. For example, sep allows you to tell pandas what separator or delimiter the CSVfile uses in case it isn’t the default comma. We will use a few more parameters in thenext chapter, but for the full overview, have a look at the pandas documentation.Now that we are dealing with big DataFrames with many thousands of rows, typicallythe first thing is to run the info method to get a summary of the DataFrame. Next,you may want to take a peek atthe first and last few rows of the DataFrame using thehead and tail methods. These methods return five rows by default, but this can bechanged by providing the desired number of rows as an argument. You can also runthe describe method to get some basic statistics:In [102]: msft.info()RangeIndex: 8622 entries, 0 to 8621Data columns (total 7 columns): # Column Non-Null Count Dtype--- ------ -------------- ----- 0 Date 8622 non-null object 1 Open 8622 non-null float64 2 High 8622 non-null float64 3 Low 8622 non-null float64 4 Close 8622 non-null float64 5 Adj Close 8622 non-null float64 6 Volume 8622 non-null int64dtypes: float64(5), int64(1), object(1)memory usage: 471.6+ KBIn [103]: # I am selecting a few columns because of space issues # You can also just run: msft.head() msft.loc[:, ["Date", "Adj Close", "Volume"]].head()Importing and Exporting DataFrames | 121https://oreil.ly/2GMhWOut[103]: Date Adj Close Volume 0 1986-03-13 0.062205 1031788800 1 1986-03-14 0.064427 308160000 2 1986-03-17 0.065537 133171200 3 1986-03-18 0.063871 67766400 4 1986-03-19 0.062760 47894400In [104]: msft.loc[:, ["Date", "Adj Close", "Volume"]].tail(2)Out[104]: Date Adj Close Volume 8620 2020-05-26 181.570007 36073600 8621 2020-05-27 181.809998 39492600In [105]: msft.loc[:, ["Adj Close", "Volume"]].describe()Out[105]: Adj Close Volume count 8622.000000 8.622000e+03 mean 24.921952 6.030722e+07 std 31.838096 3.877805e+07 min 0.057762 2.304000e+06 25% 2.247503 3.651632e+07 50% 18.454313 5.350380e+07 75% 25.699224 7.397560e+07 max 187.663330 1.031789e+09Adj Close stands for adjusted close price and corrects the stock price for corporateactions such as stock splits. Volume is the number of stocks that were traded. I havesummarized the various DataFrame exploration methods we’ve seen in this chapter inTable 5-8.Table 5-8. DataFrame exploration methods and attributesDataFrame (df) Method/Attribute Descriptiondf.info() Provides number of data points, index type, dtype, and memory usage.df.describe() Provides basic statistics including count, mean, std, min, max, and percentiles.df.head(n=5) Returns the first n rows of the DataFrame.df.tail(n=5) Returns the last n rows of the DataFrame.df.dtypes Returns the dtype of each column.The read_csv function also accepts a URL instead of a local CSV file. This is how youread the CSV file directly from the companion repo:In [106]: # The line break in the URL is only to make it fit on the page url = ("https://raw.githubusercontent.com/fzumstein/" "python-for-excel/1st-edition/csv/MSFT.csv") msft = pd.read_csv(url)In [107]: msft.loc[:, ["Date", "Adj Close", "Volume"]].head(2)Out[107]: Date Adj Close Volume 0 1986-03-13 0.062205 1031788800 1 1986-03-14 0.064427 308160000122 | Chapter 5: Data Analysis with pandasWe’ll continue with this dataset and the read_csv function in the next chapter abouttime series, where we will turn the Date column into a DatetimeIndex.ConclusionThis chapter was packed with new concepts and tools to analyze datasets in pandas.We learned how to load CSV files, how to deal with missing or duplicate data, andhow to make use of descriptive statistics. We also saw how easy it is to turn yourDataFrames into interactive plots. While it may take a while to digest everything, itprobably won’t take long before you will understand the immense power you aregaining by adding pandas to your tool belt. Along the way, we compared pandas tothe following Excel functionality:AutoFilter functionalitySee “Selecting by boolean indexing” on page 94.VLOOKUP formulaSee “Joining and Merging” on page 109.Pivot TableSee “Pivoting and Melting” on page 113.Power QueryThis is a combination of “Importing and Exporting DataFrames” on page 119,“Data Manipulation” on page 91, and “Combining DataFrames” on page 107.The next chapter is about time series analysis, the functionality that led to broadadoption of pandas by the financial industry. Let’s see why this part of pandas hassuch an edge over Excel!Conclusion | 123CHAPTER 6Time Series Analysis with pandasA time series is a series of data points along a time-based axis that plays a central rolein many different scenarios: while traders use historical stock prices to calculate riskmeasures, the weather forecast is based on time series generated by sensors measur‐ing temperature, humidity, and air pressure. And the digital marketing departmentrelies on time series generated by web pages, e.g., the source and number of pageviews per hour, and will use them to draw conclusions with regard to their marketingcampaigns.Time series analysis is one of the main driving forces why data scientists and analystshave started to look for a better alternative to Excel. The following points summarizesome of the reasons behind this move:Big datasetsTime series can quickly grow beyond Excel’s limit of roughly one million rowsper sheet. For example, if you work with intraday stock prices on a tick data level,you’re often dealing with hundreds of thousands of records—per stock and day!Date and timeAs we have seen in Chapter 3, Excel has various limitations when it comes tohandling date and time, the backbone of time series. Missing support for timezones and a number format that is limited to milliseconds are some of them. pan‐das supports time zones and uses NumPy’s datetime64[ns] data type, whichoffers a resolution in up to nanoseconds.Missing functionalityExcel misses even basic tools to be able to work with time series data in a decentway. For example, if you want to turn a daily time series into a monthly time ser‐ies, there is no easy way of doing this despite it being a very common task.125DataFrames allow you to work with various time-based indices: DatetimeIndex is themost common one and represents an index with timestamps. Other index types, likePeriodIndex, are based on time intervals such as hours or months. In this chapter,however, we are only looking at DatetimeIndex, which I will introduce now in moredetail.DatetimeIndexIn this section, we’ll learn how to construct a DatetimeIndex, how to filter such anindex to a specific time range, and how to work with time zones.Creating a DatetimeIndexTo construct a DatetimeIndex, pandas offers the date_range function. It accepts astart date, a frequency, and either the number of periods or the end date:In [1]: # Let's start by importing the packages we use in this chapter # and by setting the plotting backend to Plotly import pandas as pd import numpy as np pd.options.plotting.backend = "plotly"In [2]: # This creates a DatetimeIndex based on a start timestamp, # number of periods and frequency ("D" = daily). daily_index = pd.date_range("2020-02-28", periods=4, freq="D") daily_indexOut[2]: DatetimeIndex(['2020-02-28', '2020-02-29', '2020-03-01', '2020-03-02'], dtype='datetime64[ns]', freq='D')In [3]: # This creates a DatetimeIndex based on start/end timestamp. # The frequency is set to "weekly on Sundays" ("W-SUN"). weekly_index = pd.date_range("2020-01-01", "2020-01-31", freq="W-SUN") weekly_indexOut[3]: DatetimeIndex(['2020-01-05', '2020-01-12', '2020-01-19', '2020-01-26'], dtype='datetime64[ns]', freq='W-SUN')In [4]: # Construct a DataFrame based on the weekly_index. This could be # the visitor count of a museum that only opens on Sundays. pd.DataFrame(data=[21,15, 33, 34], columns=["visitors"], index=weekly_index)Out[4]: visitors 2020-01-05 21 2020-01-12 15 2020-01-19 33 2020-01-26 34Let’s now return to the Microsoft stock time series from the last chapter. When youtake a closer look at the data types of the columns, you will notice that the Date126 | Chapter 6: Time Series Analysis with pandascolumn has the type object, which means that pandas has interpreted the time‐stamps as strings:In [5]: msft = pd.read_csv("csv/MSFT.csv")In [6]: msft.info()RangeIndex: 8622 entries, 0 to 8621Data columns (total 7 columns): # Column Non-Null Count Dtype--- ------ -------------- ----- 0 Date 8622 non-null object 1 Open 8622 non-null float64 2 High 8622 non-null float64 3 Low 8622 non-null float64 4 Close 8622 non-null float64 5 Adj Close 8622 non-null float64 6 Volume 8622 non-null int64dtypes: float64(5), int64(1), object(1)memory usage: 471.6+ KBThere are two ways to fix this and turn it into a datetime data type. The first one is torun the to_datetime function on that column. Make sure to assign the transformedcolumn back to the original DataFrame if you want to change it at the source:In [7]: msft.loc[:, "Date"] = pd.to_datetime(msft["Date"])In [8]: msft.dtypesOut[8]: Date datetime64[ns] Open float64 High float64 Low float64 Close float64 Adj Close float64 Volume int64 dtype: objectThe other possibility is to tell read_csv about the columns that contain timestampsby using the parse_dates argument. parse_dates expects a list of column names orindices. Also, you almost always want to turn timestamps into the index of the Data‐Frame since this will allow you to filter the data easily, as we will see in a moment. Tospare yourself an extra set_index call, provide the column you would like to use asindex via the index_col argument, again as column name or index:In [9]: msft = pd.read_csv("csv/MSFT.csv", index_col="Date", parse_dates=["Date"])In [10]: msft.info()DatetimeIndex: 8622 entries, 1986-03-13 to 2020-05-27Data columns (total 6 columns):DatetimeIndex | 127 # Column Non-Null Count Dtype--- ------ -------------- ----- 0 Open 8622 non-null float64 1 High 8622 non-null float64 2 Low 8622 non-null float64 3 Close 8622 non-null float64 4 Adj Close 8622 non-null float64 5 Volume 8622 non-null int64dtypes: float64(5), int64(1)memory usage: 471.5 KBAs info reveals, you are now dealing with a DataFrame that has a DatetimeIndex.If you would need to change another data type (let’s say you wanted Volume tobe a float instead of an int), you again have two options: either providedtype={"Volume": float} as argument to the read_csv function, or apply theastype method as follows:In [11]: msft.loc[:, "Volume"] = msft["Volume"].astype("float") msft["Volume"].dtypeOut[11]: dtype('float64')With time series, it’s always a good idea to make sure the index is sorted properlybefore starting your analysis:In [12]: msft = msft.sort_index()And finally, if you need to access only parts of a DatetimeIndex, like the date partwithout the time, access the date attribute like this:In [13]: msft.index.dateOut[13]: array([datetime.date(1986, 3, 13), datetime.date(1986, 3, 14), datetime.date(1986, 3, 17), ..., datetime.date(2020, 5, 22), datetime.date(2020, 5, 26), datetime.date(2020, 5, 27)], dtype=object)Instead of date, you can also use parts of a date like year, month, day, etc. To accessthe same functionality on a regular column with data type datetime, you will have touse the dt attribute, e.g., df["column_name"].dt.date.With a sorted DatetimeIndex, let’s see how we can filter the DataFrame to certaintime periods!Filtering a DatetimeIndexIf your DataFrame has a DatetimeIndex, there is an easy way to select rows from aspecific time period by using loc with a string in the format YYYY-MM-DD HH:MM:SS.pandas will turn this string into a slice so it covers the whole period. For example, toselect all rows from 2019, provide the year as a string, not a number:128 | Chapter 6: Time Series Analysis with pandasIn [14]: msft.loc["2019", "Adj Close"]Out[14]: Date 2019-01-02 99.099190 2019-01-03 95.453529 2019-01-04 99.893005 2019-01-07 100.020401 2019-01-08 100.745613 ... 2019-12-24 156.515396 2019-12-26 157.798309 2019-12-27 158.086731 2019-12-30 156.724243 2019-12-31 156.833633 Name: Adj Close, Length: 252, dtype: float64Let’s take this a step further and plot the data between June 2019 and May 2020 (seeFigure 6-1):In [15]: msft.loc["2019-06":"2020-05", "Adj Close"].plot()Figure 6-1. Adjusted close price for MSFTHover over the Plotly chart to read off the value as a tooltip and zoom in by drawinga rectangle with your mouse. Double-click the chart to get back to the default view.We’ll use the adjusted close price in the next section to learn about time zonehandling.Working with Time ZonesMicrosoft is listed on the Nasdaq stock exchange. The Nasdaq is in New York andmarkets close at 4:00 p.m. To add this additional information to the DataFrame’sindex, first add the closing hour to the date via DateOffset, then attach the correctDatetimeIndex | 129time zone to the timestamps via tz_localize. Since the closing hour is only applica‐ble to the close price, let’s create a new DataFrame with it:In [16]: # Add the time information to the date msft_close = msft.loc[:, ["Adj Close"]].copy() msft_close.index = msft_close.index + pd.DateOffset(hours=16) msft_close.head(2)Out[16]: Adj Close Date 1986-03-13 16:00:00 0.062205 1986-03-14 16:00:00 0.064427In [17]: # Make the timestamps time-zone-aware msft_close = msft_close.tz_localize("America/New_York") msft_close.head(2)Out[17]: Adj Close Date 1986-03-13 16:00:00-05:00 0.062205 1986-03-14 16:00:00-05:00 0.064427If you want to convert the timestamps to UTC time zone, use the DataFrame methodtz_convert. UTC stands for Coordinated Universal Time and is the successor ofGreenwich Mean Time (GMT). Note how the closing hours change in UTC depend‐ing on whether daylight saving time (DST) is in effect or not in New York:In [18]: msft_close = msft_close.tz_convert("UTC") msft_close.loc["2020-01-02", "Adj Close"] # 21:00 without DSTOut[18]: Date 2020-01-02 21:00:00+00:00 159.737595 Name: Adj Close, dtype: float64In [19]: msft_close.loc["2020-05-01", "Adj Close"] # 20:00 with DSTOut[19]: Date 2020-05-01 20:00:00+00:00 174.085175 Name: Adj Close, dtype: float64Preparing time series like this will allow you to compare close prices from stockexchanges across different time zones even if the time info is missing or stated in thelocal time zone.Now that you know what a DatetimeIndex is, let’s try out a few common time seriesmanipulations in the next section by calculating and comparing stock performance.130 | Chapter 6: Time Series Analysis with pandasCommon Time Series ManipulationsIn this section, I’ll show you how to perform common time series analysis tasks suchas calculating stock returns, plotting the performance of various stocks, and visualiz‐ing the correlation of their returns in a heatmap. We’ll also see howto change the fre‐quency of time series and how to calculate rolling statistics.Shifting and Percentage ChangesIn finance, the log returns of stocks are often assumed to be normally distributed. Bylog returns, I mean the natural logarithm of the ratio of the current and previousprice. To get a feeling for the distribution of the daily log returns, let’s plot ahistogram. First, however, we need to calculate the log returns. In Excel, it’s typicallydone with a formula that involves cells from two rows, as shown in Figure 6-2.Figure 6-2. Calculating log returns in ExcelLogarithms in Excel and PythonExcel uses LN to denote the natural logarithm and LOG for the loga‐rithm with base 10. Python’s math module and NumPy, however,use log for the natural logarithm and log10 for the logarithm withbase 10.With pandas, rather than having a formula accessing two different rows, you use theshift method to shift the values down by one row. This allows you to operate on asingle row so your calculations can make use of vectorization. shift accepts a posi‐tive or negative integer that shifts the time series down or up by the respective num‐ber of rows. Let’s first see how shift works:In [20]: msft_close.head()Out[20]: Adj Close Date 1986-03-13 21:00:00+00:00 0.062205 1986-03-14 21:00:00+00:00 0.064427 1986-03-17 21:00:00+00:00 0.065537Common Time Series Manipulations | 131 1986-03-18 21:00:00+00:00 0.063871 1986-03-19 21:00:00+00:00 0.062760In [21]: msft_close.shift(1).head()Out[21]: Adj Close Date 1986-03-13 21:00:00+00:00 NaN 1986-03-14 21:00:00+00:00 0.062205 1986-03-17 21:00:00+00:00 0.064427 1986-03-18 21:00:00+00:00 0.065537 1986-03-19 21:00:00+00:00 0.063871You are now able to write a single vector-based formula that is easy to read andunderstand. To get the natural logarithm, use NumPy’s log ufunc, which is applied toeach element. Then we can plot a histogram (see Figure 6-3):In [22]: returns = np.log(msft_close / msft_close.shift(1)) returns = returns.rename(columns={"Adj Close": "returns"}) returns.head()Out[22]: returns Date 1986-03-13 21:00:00+00:00 NaN 1986-03-14 21:00:00+00:00 0.035097 1986-03-17 21:00:00+00:00 0.017082 1986-03-18 21:00:00+00:00 -0.025749 1986-03-19 21:00:00+00:00 -0.017547In [23]: # Plot a histogram with the daily log returns returns.plot.hist()Figure 6-3. Histogram plotTo get simple returns instead, use pandas’ built-in pct_change method. By default, itcalculates the percentage change from the previous row, which is also the definitionof simple returns:132 | Chapter 6: Time Series Analysis with pandasIn [24]: simple_rets = msft_close.pct_change() simple_rets = simple_rets.rename(columns={"Adj Close": "simple rets"}) simple_rets.head()Out[24]: simple rets Date 1986-03-13 21:00:00+00:00 NaN 1986-03-14 21:00:00+00:00 0.035721 1986-03-17 21:00:00+00:00 0.017229 1986-03-18 21:00:00+00:00 -0.025421 1986-03-19 21:00:00+00:00 -0.017394So far, we have looked at just the Microsoft stock. In the next section, we’re going toload more time series so we can have a look at other DataFrame methods that requiremultiple time series.Rebasing and CorrelationThings get slightly more interesting when we work with more than one time series.Let’s load a few additional adjusted close prices for Amazon (AMZN), Google (GOOGL),and Apple (AAPL), also downloaded from Yahoo! Finance:In [25]: parts = [] # List to collect individual DataFrames for ticker in ["AAPL", "AMZN", "GOOGL", "MSFT"]: # "usecols" allows us to only read in the Date and Adj Close adj_close = pd.read_csv(f"csv/{ticker}.csv", index_col="Date", parse_dates=["Date"], usecols=["Date", "Adj Close"]) # Rename the column into the ticker symbol adj_close = adj_close.rename(columns={"Adj Close": ticker}) # Append the stock's DataFrame to the parts list parts.append(adj_close)In [26]: # Combine the 4 DataFrames into a single DataFrame adj_close = pd.concat(parts, axis=1) adj_closeOut[26]: AAPL AMZN GOOGL MSFT Date 1980-12-12 0.405683 NaN NaN NaN 1980-12-15 0.384517 NaN NaN NaN 1980-12-16 0.356296 NaN NaN NaN 1980-12-17 0.365115 NaN NaN NaN 1980-12-18 0.375698 NaN NaN NaN ... ... ... ... ... 2020-05-22 318.890015 2436.879883 1413.239990 183.509995 2020-05-26 316.730011 2421.860107 1421.369995 181.570007 2020-05-27 318.109985 2410.389893 1420.280029 181.809998 2020-05-28 318.250000 2401.100098 1418.239990 NaN 2020-05-29 317.940002 2442.370117 1433.520020 NaN [9950 rows x 4 columns]Common Time Series Manipulations | 133Did you see the power of concat? pandas has automatically aligned the individualtime series along the dates. This is why you get NaN values for those stocks that don’tgo back as far as Apple. And since MSFT has NaN values at the most recent dates, youmay have guessed that I downloaded MSFT.csv two days before the other ones. Align‐ing time series by date is a typical operation that is very cumbersome to do with Exceland therefore also very error-prone. Dropping all rows that contain missing valueswill make sure that all stocks have the same amount of data points:In [27]: adj_close = adj_close.dropna() adj_close.info()DatetimeIndex: 3970 entries, 2004-08-19 to 2020-05-27Data columns (total 4 columns): # Column Non-Null Count Dtype--- ------ -------------- ----- 0 AAPL 3970 non-null float64 1 AMZN 3970 non-null float64 2 GOOGL 3970 non-null float64 3 MSFT 3970 non-null float64dtypes: float64(4)memory usage: 155.1 KBLet’s now rebase the prices so that all time series start at 100. This allows us to com‐pare their relative performance in a chart; see Figure 6-4. To rebase a time series,divide every value by its starting value and multiply by 100, the new base. If you didthis in Excel, you would typically write a formula with a combination of absolute andrelative cell references, then copy the formula for every row and every time series.In pandas, thanks to vectorization and broadcasting, you are dealing with a singleformula:In [28]: # Use a sample from June 2019 - May 2020 adj_close_sample = adj_close.loc["2019-06":"2020-05", :] rebased_prices = adj_close_sample / adj_close_sample.iloc[0, :] * 100 rebased_prices.head(2)Out[28]: AAPL AMZN GOOGL MSFT Date 2019-06-03 100.000000 100.000000 100.00000 100.000000 2019-06-04 103.658406 102.178197 101.51626 102.770372In [29]: rebased_prices.plot()134 | Chapter 6: Time Series Analysis with pandasFigure 6-4. Rebased time seriesTo see how independent the returns of the different stocks are, have a look at theircorrelations by using the corr method. Unfortunately, pandas doesn’t provide a built-in plot type to visualize the correlation matrix as a heatmap, so we need to use Plotlydirectly via its plotly.express interface (see Figure 6-5):In [30]: # Correlation of daily log returns returns = np.log(adj_close / adj_close.shift(1))returns.corr()Out[30]: AAPL AMZN GOOGL MSFT AAPL 1.000000 0.424910 0.503497 0.486065 AMZN 0.424910 1.000000 0.486690 0.485725 GOOGL 0.503497 0.486690 1.000000 0.525645 MSFT 0.486065 0.485725 0.525645 1.000000In [31]: import plotly.express as pxIn [32]: fig = px.imshow(returns.corr(), x=adj_close.columns, y=adj_close.columns, color_continuous_scale=list( reversed(px.colors.sequential.RdBu)), zmin=-1, zmax=1) fig.show()If you want to understand how imshow works in detail, have a look at the PlotlyExpress API docs.Common Time Series Manipulations | 135https://oreil.ly/O86lihttps://oreil.ly/O86liFigure 6-5. Correlation heatmapAt this point, we have already learned quite a few things about time series, includinghow to combine and clean them and how to calculate returns and correlations. Butwhat if you decide that daily returns are not a good base for your analysis and youwant monthly returns? How you change the frequency of time series data is the topicof the next section.ResamplingA regular task with time series is up- and downsampling. Upsampling means that thetime series is converted into one with a higher frequency, and downsampling meansthat it is converted into one with a lower frequency. On financial factsheets, you oftenshow monthly or quarterly performance, for example. To turn the daily time seriesinto a monthly one, use the resample method that accepts a frequency string like Mfor end-of-calendar-month or BM for end-of-business-month. You can find a list of allfrequency strings in the pandas docs. Similar to how groupby works, you then chain amethod that defines how you are resampling. I am using last to always take the lastobservation of that month:In [33]: end_of_month = adj_close.resample("M").last() end_of_month.head()Out[33]: AAPL AMZN GOOGL MSFT Date 2004-08-31 2.132708 38.139999 51.236237 17.673630 2004-09-30 2.396127 40.860001 64.864868 17.900215 2004-10-31 3.240182 34.130001 95.415413 18.107374 2004-11-30 4.146072 39.680000 91.081078 19.344421 2004-12-31 3.982207 44.290001 96.491493 19.279480136 | Chapter 6: Time Series Analysis with pandashttps://oreil.ly/zStptInstead of last, you can choose any other method that works on groupby, like sum ormean. There is also ohlc, which conveniently returns the open, high, low, and closevalues over that period. This may serve as the source to create the typical candlestickcharts that are often used with stock prices.If that end-of-month time series would be all you have and you need to produce aweekly time series out of it, you have to upsample your time series. By using asfreq,you are telling pandas not to apply any transformation and hence you will see most ofthe values showing NaN. If you wanted to forward-fill the last known value instead, usethe ffill method:In [34]: end_of_month.resample("D").asfreq().head() # No transformationOut[34]: AAPL AMZN GOOGL MSFT Date 2004-08-31 2.132708 38.139999 51.236237 17.67363 2004-09-01 NaN NaN NaN NaN 2004-09-02 NaN NaN NaN NaN 2004-09-03 NaN NaN NaN NaN 2004-09-04 NaN NaN NaN NaNIn [35]: end_of_month.resample("W-FRI").ffill().head() # Forward fillOut[35]: AAPL AMZN GOOGL MSFT Date 2004-09-03 2.132708 38.139999 51.236237 17.673630 2004-09-10 2.132708 38.139999 51.236237 17.673630 2004-09-17 2.132708 38.139999 51.236237 17.673630 2004-09-24 2.132708 38.139999 51.236237 17.673630 2004-10-01 2.396127 40.860001 64.864868 17.900215Downsampling data is one way of smoothing a time series. Calculating statistics overa rolling window is another way, as we will see next.Rolling WindowsWhen you calculate time series statistics, you often want a rolling statistic such as themoving average. The moving average looks at a subset of the time series (let’s say 25days) and takes the mean from this subset before moving the window forward by oneday. This will result in a new time series that is smoother and less prone to outliers. Ifyou are into algorithmic trading, you may be looking at the intersection of the mov‐ing average with the stock price and take this (or some variation of it) as a tradingsignal. DataFrames have a rolling method, which accepts the number of observa‐tions as argument. You then chain it with the statistical method that you want to use—in the case of the moving average, it’s the mean. By looking at Figure 6-6, you areeasily able to compare the original time series with the smoothed moving average:In [36]: # Plot the moving average for MSFT with data from 2019 msft19 = msft.loc["2019", ["Adj Close"]].copy()Common Time Series Manipulations | 137 # Add the 25 day moving average as a new column to the DataFrame msft19.loc[:, "25day average"] = msft19["Adj Close"].rolling(25).mean() msft19.plot()Figure 6-6. Moving average plotInstead of mean, you can use many other statistical measures including count, sum,median, min, max, std (standard deviation), or var (variance).At this point, we have seen the most important functionality of pandas. It’s equallyimportant, though, to understand where pandas has its limits, even though they maystill be far away right now.Limitations with pandasWhen your DataFrames start to get bigger, it’s a good idea to know the upper limit ofwhat a DataFrame can hold. Unlike Excel, where you have a hard limit of roughly onemillion rows and 12,000 columns per sheet, pandas only has a soft limit: all data mustfit into the available memory of your machine. If that’s not the case, there might besome easy fixes: only load those columns from your dataset that you need or deleteintermediate results to free up some memory. If that doesn’t help, there are quite afew projects that will feel familiar to pandas users but work with big data. One of theprojects, Dask, works on top of NumPy and pandas and allows you to work with bigdatasets by splitting it up into multiple pandas DataFrames and distributing theworkload across multiple CPU cores or machines. Other big data projects that workwith some sort of DataFrame are Modin, Koalas, Vaex, PySpark, cuDF, Ibis, andPyArrow. We will briefly touch on Modin in the next chapter but other than that, thisis not something we are going to explore further in this book.138 | Chapter 6: Time Series Analysis with pandashttps://dask.orghttps://oreil.ly/Wd8gihttps://oreil.ly/V13Behttps://vaex.iohttps://oreil.ly/E7kmXhttps://oreil.ly/zaeWzhttps://oreil.ly/Gw4wnhttps://oreil.ly/DQQGDConclusionTime series analysis is the area where I feel Excel has fallen behind the most, so afterreading this chapter, you probably understand why pandas has such a big success infinance, an industry that heavily relies on time series. We’ve seen how easy it is towork with time zones, resample time series, or produce correlation matrices, func‐tionality that either isn’t supported in Excel or requires cumbersome workarounds.Knowing how to use pandas doesn’t mean you have to get rid of Excel, though, as thetwo worlds can play very nicely together: pandas DataFrames are a great way totransfer data from one world to the other, as we will see in the next part, which isabout reading and writing Excel files in ways that bypass the Excel applicationentirely. This is very helpful as it means you can manipulate Excel files with Pythonon every operating system that Python supports, including Linux. Tostart this jour‐ney, the next chapter will show you how pandas can be used to automate tediousmanual processes like the aggregation of Excel files into summary reports.Conclusion | 139PART IIIReading and Writing Excel FilesWithout ExcelCHAPTER 7Excel File Manipulation with pandasAfter six chapters of intense introductions to tools, Python, and pandas, I will giveyou a break and start this chapter with a practical case study that allows you to putyour newly acquired skills to good use: with just ten lines of pandas code, you willconsolidate dozens of Excel files into an Excel report, ready to be sent to your manag‐ers. After the case study, I’ll give you a more in-depth introduction to the tools thatpandas offers to work with Excel files: the read_excel function and the ExcelFileclass for reading, and the to_excel method and the ExcelWriter class for writingExcel files. pandas does not rely on the Excel application to read and write Excel files,which means that all code samples in this chapter run everywhere Python runs,including Linux.Case Study: Excel ReportingThis case study is inspired by a few real-world reporting projects I was involved inover the last few years. Even though the projects took place in completely differentindustries—including telecommunication, digital marketing, and finance—they werestill remarkably similar: the starting point is usually a directory with Excel files thatneed to be processed into an Excel report—often on a monthly, weekly, or daily basis.In the companion repository, in the sales_data directory, you will find Excel files withfictitious sales transactions for a telecommunication provider selling different plans(Bronze, Silver, Gold) in a few stores throughout the United States. For every month,there are two files, one in the new subfolder for new contracts and one in the existingsubfolder for existing customers. As the reports come from different systems, theycome in different formats: the new customers are delivered as xlsx files, while theexisting customers arrive in the older xls format. Each of the files has up to 10,000transactions, and our goal is to produce an Excel report that shows the total sales per143store and month. To get started, let’s have a look at the January.xlsx file from the newsubfolder in Figure 7-1.Figure 7-1. The first few rows of January.xlsxThe Excel files in the existing subfolder look practically the same, except that they aremissing the status column and are stored in the legacy xls format. As a first step, let’sread the new transactions from January with pandas’ read_excel function:In [1]: import pandas as pdIn [2]: df = pd.read_excel("sales_data/new/January.xlsx") df.info()RangeIndex: 9493 entries, 0 to 9492Data columns (total 7 columns): # Column Non-Null Count Dtype--- ------ -------------- ----- 0 transaction_id 9493 non-null object 1 store 9493 non-null object 2 status 9493 non-null object 3 transaction_date 9493 non-null datetime64[ns] 4 plan 9493 non-null object 5 contract_type 9493 non-null object 6 amount 9493 non-null float64dtypes: datetime64[ns](1), float64(1), object(5)memory usage: 519.3+ KBThe read_excel Function with Python 3.9This is the same warning as in Chapter 5: if you are runningpd.read_excel with Python 3.9 or above, make sure to use at leastpandas 1.2 or you will get an error when reading xlsx files.As you can see, pandas has properly recognized the data types of all columns, includ‐ing the date format of transaction_date. This allows us to work with the datawithout further preparation. As this sample is deliberately simple, we can move onwith creating a short script called sales_report_pandas.py as shown in Example 7-1.This script will read in all Excel files from both directories, aggregate the data, and144 | Chapter 7: Excel File Manipulation with pandaswrite the summary table into a new Excel file. Use VS Code to write the script your‐self, or open it from the companion repository. For a refresher on how to create oropen files in VS Code, have another look at Chapter 2. If you create it yourself, makesure to place it next to the sales_data folder—this will allow you to run the scriptwithout having to adjust any file paths.Example 7-1. sales_report_pandas.pyfrom pathlib import Pathimport pandas as pd# Directory of this filethis_dir = Path(__file__).resolve().parent # Read in all Excel files from all subfolders of sales_dataparts = []for path in (this_dir / "sales_data").rglob("*.xls*"): print(f'Reading {path.name}') part = pd.read_excel(path, index_col="transaction_id") parts.append(part)# Combine the DataFrames from each file into a single DataFrame# pandas takes care of properly aligning the columnsdf = pd.concat(parts)# Pivot each store into a column and sum up all transactions per datepivot = pd.pivot_table(df, index="transaction_date", columns="store", values="amount", aggfunc="sum")# Resample to end of month and assign an index namesummary = pivot.resample("M").sum()summary.index.name = "Month"# Write summary report to Excel filesummary.to_excel(this_dir / "sales_report_pandas.xlsx")Up to this chapter, I was using strings to specify file paths. By using the Path classfrom the standard library’s pathlib module instead, you get access to a powerfulset of tools: path objects enable you to easily construct paths by concatenatingindividual parts via forward slashes, as it’s done four lines below with this_dir /"sales_data". These paths work across platforms and allow you to apply filterslike rglob as explained under the next point. __file__ resolves to the path of thesource code file when you run it—using its parent will give you therefore thename of the directory of this file. The resolve method that we use before callingCase Study: Excel Reporting | 145parent turns the path into an absolute path. If you would run this from a Jupyternotebook instead, you would have to replace this line with this_dir =Path(".").resolve(), with the dot representing the current directory. In mostcases, functions and classes that accept a path in the form of a string also accept apath object.The easiest way to read in all Excel files recursively from within a certain direc‐tory is to use the rglob method of the path object. glob is short for globbing,which refers to pathname expansion using wildcards. The ? wildcard representsexactly one character, while * stands for any number of characters (includingzero). The r in rglob means recursive globbing, i.e., it will look for matching filesacross all subdirectories—accordingly, glob would ignore subdirectories. Using*.xls* as the globbing expression makes sure that the old and new Excel files arefound, as it matches both .xls and .xlsx. It’s usually a good idea to slightlyenhance the expression like this: [!~$]*.xls*. This ignores temporary Excel files(their file name starts with ~$). For more background on how to use globbing inPython, see the Python docs.Run the script, for example, by clicking the Run File button at the top right of VSCode. The script will take a moment to complete and once done, the Excel workbooksales_report_pandas.xlsx will show up in the same directory as the script. The contentof Sheet1 should look like in Figure 7-2. That’s quite an impressive result for only tenlines of code—even if you will need to adjust the width of the first column to be ableto see the dates!Figure 7-2. sales_report_pandas.xlsx (as-is, without adjusting any column width)146 | Chapter 7: Excel File Manipulation with pandashttps://oreil.ly/fY0qGFor simple cases like this one, pandas offers a really easy solution to work with Excelfiles. However, we can do much better—after all, a title, some formatting204Improving Performance 206How to Work Around Missing Functionality 207Conclusion 20810. Python-Powered Excel Tools. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209Using Excel as Frontend with xlwings 209Excel Add-in 210Quickstart Command 212Run Main 212RunPython Function 213Deployment 218Python Dependency 218Standalone Workbooks: Getting Rid of the xlwings Add-in 219Configuration Hierarchy 220Settings 221Conclusion 22211. The Python Package Tracker. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223What We Will Build 223Core Functionality 226Web APIs 226Databases 229Exceptions 238Application Structure 240Frontend 241Backend 245Debugging 248Conclusion 25012. User-Defined Functions (UDFs). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251Getting Started with UDFs 252UDF Quickstart 252Case Study: Google Trends 257Introduction to Google Trends 257Working with DataFrames and Dynamic Arrays 258Fetching Data from Google Trends 263Plotting with UDFs 267Table of Contents | ixDebugging UDFs 269Advanced UDF Topics 271Basic Performance Optimization 272Caching 274The Sub Decorator 276Conclusion 277A. Conda Environments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281B. Advanced VS Code Functionality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285C. Advanced Python Concepts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299x | Table of ContentsPrefaceMicrosoft is running a feedback forum for Excel on UserVoice where everybody cansubmit a new idea for others to vote on. The top voted feature request is “Python asan Excel scripting language,” and it has roughly twice as many votes as the secondmost voted feature request. Though nothing really happened since the idea was addedin 2015, Excel users were fueled with new hope at the end of 2020 when Guido vanRossum, the creator of Python, tweeted that his “retirement was boring” and hewould join Microsoft. If his move has any influence on the integration of Excel andPython, I don’t know. I do know, however, what makes this combination so compel‐ling and how you can start using Excel and Python together—today. And this is, in anutshell, what this book is about.The main driving force behind the Python for Excel story is the fact that we are livingin a world of data. Nowadays, huge datasets are available to everybody and abouteverything. Often, these datasets are so big that they don’t fit into a spreadsheet any‐more. A few years ago, this may have been referred to as big data, but nowadays, adataset of a few million rows is really nothing special. Excel has evolved to cope withthat trend: it introduced Power Query to load and clean datasets that don’t fit into aspreadsheet and Power Pivot, an add-in to perform data analysis on these datasetsand present the results. Power Query is based on the Power Query M formula lan‐guage (M), while Power Pivot defines formulas by using Data Analysis Expressions(DAX). If you also want to automate a few things in your Excel file, then you woulduse Excel’s built-in automation language, Visual Basic for Applications (VBA). Thatis, for something fairly simple, you can end up using VBA, M, and DAX. One issuewith this is that all these languages only serve you in the Microsoft world, mostprominently in Excel and Power BI (I will introduce Power BI briefly in Chapter 1).Python, on the other hand, is a general-purpose programming language that hasbecome one of the most popular choices amongst analysts and data scientists. If youuse Python with Excel, you are able to use a programming language that is good at allaspects of the story, whether that’s automating Excel, accessing and preparing data‐sets, or performing data analysis and visualization tasks. Most importantly, you canxihttps://oreil.ly/y1XwUhttps://oreil.ly/N1_7Nreuse your Python skills outside of Excel: if you need to scale up your computingpower, you could easily move your quantitative model, simulation, or machine learn‐ing application(includingcolumn width and a consistent number of decimals), and a chart wouldn’t hurt. That’sexactly what we will take care of in the next chapter by directly using the writer libra‐ries that pandas uses under the hood. Before we get there, however, let’s have a moredetailed look at how we can read and write Excel files with pandas.Reading and Writing Excel Files with pandasThe case study was using read_excel and to_excel with their default arguments tokeep things simple. In this section, I will show you the most commonly used argu‐ments and options when reading and writing Excel files with pandas. We’ll start withthe read_excel function and the ExcelFile class before looking at the to_excelmethod and the ExcelWriter class. Along the way, I’ll also introduce Python’s withstatement.The read_excel Function and ExcelFile ClassThe case study used Excel workbooks where the data was conveniently in cell A1 ofthe first sheet. In reality, your Excel files are probably not so well organized. In thiscase, pandas offers parameters to fine-tune the reading process. For the next few sam‐ples, we’re going to use the stores.xlsx file that you will find in the xl folder of the com‐panion repository. The first sheet is shown in Figure 7-3.Figure 7-3. The first sheet of stores.xlsxBy using the parameters sheet_name, skiprows, and usecols, we can tell pandasabout the cell range that we want to read in. As usual, it’s a good idea to have a look atthe data types of the returned DataFrame by running the info method:In [3]: df = pd.read_excel("xl/stores.xlsx", sheet_name="2019", skiprows=1, usecols="B:F") dfReading and Writing Excel Files with pandas | 147Out[3]: Store Employees Manager Since Flagship 0 New York 10 Sarah 2018-07-20 False 1 San Francisco 12 Neriah 2019-11-02 MISSING 2 Chicago 4 Katelin 2020-01-31 NaN 3 Boston 5 Georgiana 2017-04-01 True 4 Washington DC 3 Evan NaT False 5 Las Vegas 11 Paul 2020-01-06 FalseIn [4]: df.info()RangeIndex: 6 entries, 0 to 5Data columns (total 5 columns): # Column Non-Null Count Dtype--- ------ -------------- ----- 0 Store 6 non-null object 1 Employees 6 non-null int64 2 Manager 6 non-null object 3 Since 5 non-null datetime64[ns] 4 Flagship 5 non-null objectdtypes: datetime64[ns](1), int64(1), object(3)memory usage: 368.0+ bytesEverything looks good except for the Flagship column—its data type should be boolrather than object. To fix this, we can provide a converter function that deals withthe offensive cells in that column (instead of writing the fix_missing function, wecould have also provided a lambda expression instead):In [5]: def fix_missing(x): return False if x in ["", "MISSING"] else xIn [6]: df = pd.read_excel("xl/stores.xlsx", sheet_name="2019", skiprows=1, usecols="B:F", converters={"Flagship": fix_missing}) dfOut[6]: Store Employees Manager Since Flagship 0 New York 10 Sarah 2018-07-20 False 1 San Francisco 12 Neriah 2019-11-02 False 2 Chicago 4 Katelin 2020-01-31 False 3 Boston 5 Georgiana 2017-04-01 True 4 Washington DC 3 Evan NaT False 5 Las Vegas 11 Paul 2020-01-06 FalseIn [7]: # The Flagship column now has Dtype "bool" df.info()RangeIndex: 6 entries, 0 to 5Data columns (total 5 columns): # Column Non-Null Count Dtype--- ------ -------------- ----- 0 Store 6 non-null object 1 Employees 6 non-null int64 2 Manager 6 non-null object148 | Chapter 7: Excel File Manipulation with pandas 3 Since 5 non-null datetime64[ns] 4 Flagship 6 non-null booldtypes: bool(1), datetime64[ns](1), int64(1), object(2)memory usage: 326.0+ bytesThe read_excel function also accepts a list of sheet names. In this case, it returns adictionary with the DataFrame as value and the name of the sheet as key. To read inall sheets, you would need to provide sheet_name=None. Also, note the slight varia‐tion of how I am using usecols by providing the column names of the table:In [8]: sheets = pd.read_excel("xl/stores.xlsx", sheet_name=["2019", "2020"], skiprows=1, usecols=["Store", "Employees"]) sheets["2019"].head(2)Out[8]: Store Employees 0 New York 10 1 San Francisco 12If the source file doesn’t have column headers, set header=None and provide them vianames. Note that sheet_name also accepts sheet indices:In [9]: df = pd.read_excel("xl/stores.xlsx", sheet_name=0, skiprows=2, skipfooter=3, usecols="B:C,F", header=None, names=["Branch", "Employee_Count", "Is_Flagship"]) dfOut[9]: Branch Employee_Count Is_Flagship 0 New York 10 False 1 San Francisco 12 MISSING 2 Chicago 4 NaNTo handle NaN values, use a combination of na_values and keep_default_na. Thenext sample tells pandas to only interpret cells with the word MISSING as NaN andnothing else:In [10]: df = pd.read_excel("xl/stores.xlsx", sheet_name="2019", skiprows=1, usecols="B,C,F", skipfooter=2, na_values="MISSING", keep_default_na=False) dfOut[10]: Store Employees Flagship 0 New York 10 False 1 San Francisco 12 NaN 2 Chicago 4 3 Boston 5 Truepandas offers an alternative way to read Excel files by using the ExcelFile class. Thismostly makes a difference if you want to read in multiple sheets from a file in thelegacy xls format: in this case, using ExcelFile will be faster as it prevents pandasfrom reading in the whole file multiple times. ExcelFile can be used as a contextmanager (see sidebar) so the file is properly closed again.Reading and Writing Excel Files with pandas | 149Context Managers and the with StatementFirst of all, the with statement in Python doesn’t have anything to do with the Withstatement in VBA: in VBA, it is used to run a series of statements on the same object,while in Python, it is used to manage resources like files or database connections. Ifyou want to load the latest sales data to be able to analyze it, you may have to open afile or establish a connection to a database. After you’re done reading the data, it’s bestpractice to close the file or connection as soon as possible again. Otherwise, you mayrun into situations where you can’t open another file or can’t establish another con‐nection to the database—file handlers and database connections are limited resour‐ces. Opening and closing a text file manually works like this (w stands for opening thefile in write mode, which replaces the file if it already exists):In [11]: f = open("output.txt", "w") f.write("Some text") f.close()Running this code will create a file called output.txt in the same directory as the note‐book you are running it from and write “some text” to it. To read a file, you would user instead of w, and to append to the end of the file, use a. Since files can also bemanipulated from outside of your program, such an operation could fail. You couldhandle this by using the try/except mechanism that I will introduce in Chapter 11.However,since this is such a common operation, Python is providing the with state‐ment to make things easier:In [12]: with open("output.txt", "w") as f: f.write("Some text")When code execution leaves the body of the with statement, the file is automaticallyclosed, whether or not there is an exception happening. This guarantees that theresources are properly cleaned up. Objects that support the with statement are calledcontext managers; this includes the ExcelFile and ExcelWriter objects in this chap‐ter, as well as database connection objects that we will look at in Chapter 11.Let’s see the ExcelFile class in action:In [13]: with pd.ExcelFile("xl/stores.xls") as f: df1 = pd.read_excel(f, "2019", skiprows=1, usecols="B:F", nrows=2) df2 = pd.read_excel(f, "2020", skiprows=1, usecols="B:F", nrows=2) df1Out[13]: Store Employees Manager Since Flagship 0 New York 10 Sarah 2018-07-20 False 1 San Francisco 12 Neriah 2019-11-02 MISSING150 | Chapter 7: Excel File Manipulation with pandasExcelFile also gives you access to the names of all sheets:In [14]: stores = pd.ExcelFile("xl/stores.xlsx") stores.sheet_namesOut[14]: ['2019', '2020', '2019-2020']Finally, pandas allows you to read Excel files from a URL, similar to how we did itwith CSV files in Chapter 5. Let’s read it directly from the companion repo:In [15]: url = ("https://raw.githubusercontent.com/fzumstein/" "python-for-excel/1st-edition/xl/stores.xlsx") pd.read_excel(url, skiprows=1, usecols="B:E", nrows=2)Out[15]: Store Employees Manager Since 0 New York 10 Sarah 2018-07-20 1 San Francisco 12 Neriah 2019-11-02Reading xlsb Files via pandasIf you use pandas with a version below 1.3, reading xlsb filesrequires you to explicitly specify the engine in the read_excelfunction or ExcelFile class:pd.read_excel("xl/stores.xlsb", engine="pyxlsb")This requires the pyxlsb package to be installed, as it isn’t part ofAnaconda—we’ll get to that as well as to the other engines in thenext chapter.To summarize, Table 7-1 shows you the most commonly used read_excel parame‐ters. You will find the complete list in the official docs.Table 7-1. Selected parameters for read_excelParameter Descriptionsheet_name Instead of providing a sheet name, you could also provide the index of the sheet (zero-based), e.g.,sheet_name=0. If you set sheet_name=None, pandas will read the whole workbook andreturn a dictionary in the form of {"sheetname": df}. To read a selection of sheets, provide alist with sheet names or indices.skiprows This allows you to skip over the indicated number of rows.usecols If the Excel file includes the names of the column headers, provide them in a list to select thecolumns, e.g., ["Store", "Employees"]. Alternatively, it can also be a list of column indices,e.g., [1, 2], or a string (not a list!) of Excel column names, including ranges, e.g., "B:D,G". Youcan also provide a function: as an example, to only include the columns that start with Manager,use: usecols=lambda x: x.startswith("Manager").nrows Number of rows you want to read.index_col Indicates which column should be the index, accepts a column name or an index, e.g.,index_col=0. If you provide a list with multiple columns, a hierarchical index will be created.Reading and Writing Excel Files with pandas | 151https://oreil.ly/v8YesParameter Descriptionheader If you set header=None, the default integer headers are assigned except if you provide thedesired names via the names parameter. If you provide a list of indices, hierarchical column headerswill be created.names Provide the desired names of your columns as list.na_values Pandas interprets the following cell values as NaN by default (I introduced NaN in Chapter 5): emptycells, #NA, NA, null, #N/A, N/A, NaN, n/a, -NaN, 1.#IND, nan, #N/A N/A, -1.#QNAN, -nan, NULL, -1.#IND, , 1.#QNAN. If you’d like to add one or more values to that list,provide them via na_values.keep_default_na If you’d like to ignore the default values that pandas interprets as NaN, setkeep_default_na=False.convert_float Excel stores all numbers internally as floats and by default, pandas transforms numbers withoutmeaningful decimals to integers. If you want to change that behavior, setconvert_float=False (this may be a bit faster).converters Allows you to provide a function per column to convert its values. For example, to make the text in acertain column uppercase, use the following:converters={"column_name": lambda x: x.upper()}So much for reading Excel files with pandas—let’s now switch sides and learn aboutwriting Excel files in the next section!The to_excel Method and ExcelWriter ClassThe easiest way to write an Excel file with pandas is to use a DataFrame’s to_excelmethod. It allows you to specify to which cell of which sheet you want to write theDataFrame to. You can also decide whether or not to include the column headers andthe index of the DataFrame and how to treat data types like np.nan and np.inf thatdon’t have an equivalent representation in Excel. Let’s start by creating a DataFramewith different data types and use its to_excel method:In [16]: import numpy as np import datetime as dtIn [17]: data=[[dt.datetime(2020,1,1, 10, 13), 2.222, 1, True], [dt.datetime(2020,1,2), np.nan, 2, False], [dt.datetime(2020,1,2), np.inf, 3, True]] df = pd.DataFrame(data=data, columns=["Dates", "Floats", "Integers", "Booleans"]) df.index.name="index" dfOut[17]: Dates Floats Integers Booleans index 0 2020-01-01 10:13:00 2.222 1 True 1 2020-01-02 00:00:00 NaN 2 False 2 2020-01-02 00:00:00 inf 3 True152 | Chapter 7: Excel File Manipulation with pandasIn [18]: df.to_excel("written_with_pandas.xlsx", sheet_name="Output", startrow=1, startcol=1, index=True, header=True, na_rep="", inf_rep="")Running the to_excel command will create the Excel file as shown in Figure 7-4(you will need to make column C wider to see the dates properly):Figure 7-4. written_with_pandas.xlsxIf you want to write multiple DataFrames to the same or different sheets, you willneed to use the ExcelWriter class. The following sample writes the same DataFrameto two different locations on Sheet1 and one more time to Sheet2:In [19]: with pd.ExcelWriter("written_with_pandas2.xlsx") as writer: df.to_excel(writer, sheet_name="Sheet1", startrow=1, startcol=1) df.to_excel(writer, sheet_name="Sheet1", startrow=10, startcol=1) df.to_excel(writer, sheet_name="Sheet2")Since we’re using the ExcelWriter class as a context manager, the file is automaticallywritten to disk when it exits the context manager, i.e., when the indentation stops.Otherwise, you will have to call writer.save() explicitly. For a summary of the mostcommonly used parameters that to_excel accepts, have a look at Table 7-2. You willfind the full list of parameters in the official docs.Table 7-2. Selected parameters for to_excelParameter Descriptionsheet_name Name of the sheet to write to.startrow andstartcolstartrow is the first row where the DataFrame will be written to and startcol is the firstcolumn. This uses zero-based indexing, so if you want to write your DataFrame into cell B3, usestartrow=2 and startcol=1.index and header If you want to hide the index and/or header, set them to index=False and header=False,respectively.na_rep and inf_rep By default, np.nan will be converted to an empty cell, while np.inf, NumPy’s representationof infinity, will be converted to the string inf. Providing values allows you tochange thisbehavior.freeze_panes Freeze the first couple of rows and columns by supplying a tuple: for example (2, 1) will freezethe first two rows and the first column.Reading and Writing Excel Files with pandas | 153https://oreil.ly/ESKAGAs you can see, reading and writing simple Excel files with pandas works well. Thereare limitations, though—let’s see which ones!Limitations When Using pandas with Excel FilesUsing the pandas interface to read and write Excel files works great for simple cases,but there are limits:• When writing DataFrames to files, you can’t include a title or a chart.• There is no way to change the default format of the header and index in Excel.• When reading files, pandas automatically transforms cells with errors like #REF!or #NUM! into NaN, making it impossible to search for specific errors in yourspreadsheets.• Working with big Excel files may require extra settings that are easier to controlby using the reader and writer packages directly, as we will see in the nextchapter.ConclusionThe nice thing about pandas is that it offers a consistent interface to work with allsupported Excel file formats, whether that’s xls, xlsx, xlsm, or xlsb. This made it easyfor us to read a directory of Excel files, aggregate the data, and dump the summaryinto an Excel report—in only ten lines of code.pandas, however, doesn’t do the heavy lifting itself: under the hood, it selects a readeror writer package to do the job. In the next chapter, I will show you which reader andwriter packages pandas uses and how you use them directly or in combination withpandas. This will allow us to work around the limitations we saw in the previoussection.154 | Chapter 7: Excel File Manipulation with pandasCHAPTER 8Excel File Manipulation with Readerand Writer PackagesThis chapter introduces you to OpenPyXL, XlsxWriter, pyxlsb, xlrd, and xlwt: theseare the packages that can read and write Excel files and are used by pandas under thehood when you call the read_excel or to_excel functions. Using the reader andwriter packages directly allows you to create more complex Excel reports as well asfine-tune the reading process. Also, should you ever work on a project where youonly need to read and write Excel files without the need for the rest of the pandasfunctionality, installing the full NumPy/pandas stack would probably be overkill.We’ll start this chapter by learning when to use which package and how their syntaxworks before looking at a few advanced topics, including how to work with big Excelfiles and how to combine pandas with the reader and writer packages to improve thestyling of DataFrames. To conclude, we will pick up the case study from the begin‐ning of the last chapter again and enhance the Excel report by formatting the tableand adding a chart. Like the last chapter, this chapter does not require an installationof Excel, which means that all code samples run on Windows, macOS, and Linux.The Reader and Writer PackagesThe reader and writer landscape can be a bit overwhelming: we are going to look atno less than six packages in this section as almost every Excel file type requires a dif‐ferent package. The fact that each package uses a different syntax that often deviatessubstantially from the original Excel object model doesn’t make it easier—I’ll saymore about the Excel object model in the next chapter. This means that you will likelyhave to look up a lot of commands, even if you are a seasoned VBA developer. Thissection starts with an overview of when you need which package before it introducesa helper module that makes working with these packages a little easier. After that, it155presents each of the packages in a cookbook style, where you can look up how themost commonly used commands work.When to Use Which PackageThis section introduces the following six packages to read, write, and edit Excel files:• OpenPyXL• XlsxWriter• pyxlsb• xlrd• xlwt• xlutilsTo understand which package can do what, have a look at Table 8-1. For example, toread the xlsx file format, you will have to use the OpenPyXL package:Table 8-1. When to use which packageExcel File Format Read Write Editxlsx OpenPyXL OpenPyXL, XlsxWriter OpenPyXLxlsm OpenPyXL OpenPyXL, XlsxWriter OpenPyXLxltx, xltm OpenPyXL OpenPyXL OpenPyXLxlsb pyxlsb - -xls, xlt xlrd xlwt xlutilsIf you want to write xlsx or xlsm files, you need to decide between OpenPyXL andXlsxWriter. Both packages cover similar functionality, but each package may have afew unique features that the other one doesn’t have. As both libraries are activelybeing developed, this is changing over time. Here is a high-level overview of wherethey differ:• OpenPyXL can read, write, and edit while XlsxWriter can only write.• OpenPyXL makes it easier to produce Excel files with VBA macros.• XlsxWriter is better documented.• XlsxWriter tends to be faster than OpenPyXL, but depending on the size of theworkbook you’re writing, the differences may not be significant.156 | Chapter 8: Excel File Manipulation with Reader and Writer Packageshttps://oreil.ly/3jHQMhttps://oreil.ly/7jI3Thttps://oreil.ly/sEHXShttps://oreil.ly/tSam7https://oreil.ly/wPSLehttps://oreil.ly/MTFOLWhere Is xlwings?If you are wondering where xlwings is in table Table 8-1, then theanswer is nowhere or everywhere, depending on your use case:unlike any of the packages in this chapter, xlwings depends on theExcel application, which often isn’t available, for example, if youneed to run your scripts on Linux. If, on the other hand, you areOK with running your scripts on Windows or macOS where youhave access to an installation of Excel, xlwings can indeed be usedas an alternative to all packages in this chapter. Since Excel depend‐ency is such a fundamental difference between xlwings and all theother Excel packages, I am introducing xlwings in the next chapter,which starts Part IV of this book.pandas uses the writer package it can find and if you have both OpenPyXL andXlsxWriter installed, XlsxWriter is the default. If you want to choose which packagepandas should use, specify the engine parameter in the read_excel or to_excelfunctions or the ExcelFile and ExcelWriter classes, respectively. The engine is thepackage name in lower case, so to write a file with OpenPyXL instead of XlsxWriter,run the following:df.to_excel("filename.xlsx", engine="openpyxl")Once you know which package you need, there is a second challenge waiting for you:most of these packages require you to write quite a bit of code to read or write a rangeof cells, and each package uses a different syntax. To make your life easier, I created ahelper module that I’ll introduce next.The excel.py ModuleI have created the excel.py module to make your life easier when using the readerand writer packages, as it takes care of the following issues:Package switchingHaving to switch the reader or writer package is a relatively common scenario.For example, Excel files tend to grow in size over time, which many users fight byswitching the file format from xlsx to xlsb as this can substantially reduce the filesize. In that case, you will have to switch away from OpenPyXL to pyxlsb. Thisforces you to rewrite your OpenPyXL code to reflect pyxlsb’s syntax.Data type conversionThis is connected to the previous point: when switching packages, you don’t justhave to adjust the syntax of your code, but you also need to watch out for differ‐ent data types that these packages return for the same cell content. For example,OpenPyXL returns None for empty cells, while xlrd returns an empty string.The Reader and Writer Packages | 157Cell loopingThe reader and writer packages are low-level packages: this means that they lackconvenience functions that would allow you to tackle common tasks easily. Forexample, most of the packages require you to loop through every single cell thatyou are going to read or write.You willfind the excel.py module in the companion repository and we will use it inthe upcoming sections, but as a preview, here is the syntax to read and write values:import excelvalues = excel.read(sheet_object, first_cell="A1", last_cell=None)excel.write(sheet_object, values, first_cell="A1")The read function accepts a sheet object from one of the following packages: xlrd,OpenPyXL, or pyxlsb. It also accepts the optional arguments first_cell andlast_cell. They can be provided in either the A1 notation or as row-column-tuplewith Excel’s one-based indices: (1, 1). The default value for the first_cell is A1whereas the default value for last_cell is the bottom right corner of the used range.Hence, if you only provide the sheet object, it will read the whole sheet. The writefunction works similarly: it expects a sheet object from xlwt, OpenPyXL, orXlsxWriter along with the values as nested list and an optional first_cell, whichmarks the upper left corner of where the nested list will be written to. The excel.pymodule also harmonizes the data type conversion as shown in Table 8-2.Table 8-2. Data type conversionExcel representation Python data typeEmpty cell NoneCell with a date format datetime.datetime (except for pyxlsb)Cell with a boolean boolCell with an error str (the error message)String strFloat float or intEquipped with the excel.py module, we’re now ready to dive into the packages: thenext four sections are about OpenPyXL, XlsxWriter, pyxlsb, and xlrd/xlwt/xlutils.They follow a cookbook style that allows you to get started quickly with each package.Instead of reading through it sequentially, I would recommend you to pick the pack‐age you need based on Table 8-1, then jump directly to the corresponding section.158 | Chapter 8: Excel File Manipulation with Reader and Writer PackagesThe with StatementWe’ll be using the with statement on various occasions in thischapter. If you need a refresher, have a look at the sidebar “ContextManagers and the with Statement” on page 150 in Chapter 7.OpenPyXLOpenPyXL is the only package in this section that can both read and write Excel files.You can even use it to edit Excel files—albeit only simple ones. Let’s start by lookingat how reading works!Reading with OpenPyXLThe following sample code shows you how to perform common tasks when you useOpenPyXL for reading Excel files. To get the cell values, you need to open the work‐book with data_only=True. The default is on False, which would return the formu‐las of the cells instead:In [1]: import pandas as pd import openpyxl import excel import datetime as dtIn [2]: # Open the workbook to read cell values. # The file is automatically closed again after loading the data. book = openpyxl.load_workbook("xl/stores.xlsx", data_only=True)In [3]: # Get a worksheet object by name or index (0-based) sheet = book["2019"] sheet = book.worksheets[0]In [4]: # Get a list with all sheet names book.sheetnamesOut[4]: ['2019', '2020', '2019-2020']In [5]: # Loop through the sheet objects. # Instead of "name", openpyxl uses "title". for i in book.worksheets: print(i.title)201920202019-2020In [6]: # Getting the dimensions, # i.e., the used range of the sheet sheet.max_row, sheet.max_columnOut[6]: (8, 6)In [7]: # Read the value of a single cell # using "A1" notation and using cell indices (1-based)The Reader and Writer Packages | 159 sheet["B6"].value sheet.cell(row=6, column=2).valueOut[7]: 'Boston'In [8]: # Read in a range of cell values by using our excel module data = excel.read(book["2019"], (2, 2), (8, 6)) data[:2] # Print the first two rowsOut[8]: [['Store', 'Employees', 'Manager', 'Since', 'Flagship'], ['New York', 10, 'Sarah', datetime.datetime(2018, 7, 20, 0, 0), False]]Writing with OpenPyXLOpenPyXL builds the Excel file in memory and writes out the file once you call thesave method. The following code produces the file as shown in Figure 8-1:In [9]: import openpyxl from openpyxl.drawing.image import Image from openpyxl.chart import BarChart, Reference from openpyxl.styles import Font, colors from openpyxl.styles.borders import Border, Side from openpyxl.styles.alignment import Alignment from openpyxl.styles.fills import PatternFill import excelIn [10]: # Instantiate a workbook book = openpyxl.Workbook() # Get the first sheet and give it a name sheet = book.active sheet.title = "Sheet1" # Writing individual cells using A1 notation # and cell indices (1-based) sheet["A1"].value = "Hello 1" sheet.cell(row=2, column=1, value="Hello 2") # Formatting: fill color, alignment, border and font font_format = Font(color="FF0000", bold=True) thin = Side(border_style="thin", color="FF0000") sheet["A3"].value = "Hello 3" sheet["A3"].font = font_format sheet["A3"].border = Border(top=thin, left=thin, right=thin, bottom=thin) sheet["A3"].alignment = Alignment(horizontal="center") sheet["A3"].fill = PatternFill(fgColor="FFFF00", fill_type="solid") # Number formatting (using Excel's formatting strings) sheet["A4"].value = 3.3333 sheet["A4"].number_format = "0.00" # Date formatting (using Excel's formatting strings) sheet["A5"].value = dt.date(2016, 10, 13)160 | Chapter 8: Excel File Manipulation with Reader and Writer Packages sheet["A5"].number_format = "mm/dd/yy" # Formula: you must use the English name of the formula # with commas as delimiters sheet["A6"].value = "=SUM(A4, 2)" # Image sheet.add_image(Image("images/python.png"), "C1") # Two-dimensional list (we're using our excel module) data = [[None, "North", "South"], ["Last Year", 2, 5], ["This Year", 3, 6]] excel.write(sheet, data, "A10") # Chart chart = BarChart() chart.type = "col" chart.title = "Sales Per Region" chart.x_axis.title = "Regions" chart.y_axis.title = "Sales" chart_data = Reference(sheet, min_row=11, min_col=1, max_row=12, max_col=3) chart_categories = Reference(sheet, min_row=10, min_col=2, max_row=10, max_col=3) # from_rows interprets the data in the same way # as if you would add a chart manually in Excel chart.add_data(chart_data, titles_from_data=True, from_rows=True) chart.set_categories(chart_categories) sheet.add_chart(chart, "A15") # Saving the workbook creates the file on disk book.save("openpyxl.xlsx")If you want to write an Excel template file, you’ll need to set the template attribute toTrue before saving it:In [11]: book = openpyxl.Workbook() sheet = book.active sheet["A1"].value = "This is a template" book.template = True book.save("template.xltx")As you can see in the code, OpenPyXL is setting colors by providing a string likeFF0000. This value is made up of three hex values (FF, 00, and 00) that correspond tothe red/green/blue values of the desired color. Hex stands for hexadecimal and repre‐sents numbers using a base of sixteen instead of a base of ten that is used by our stan‐dard decimal system.The Reader and Writer Packages | 161Finding the Hex Value of a ColorTo find the desired hex value of a color in Excel, click on the paintdropdown that you would use to change the fill color of a cell, thenselect More Colors. Now select your color and read off its hex valuefrom the menu.Figure 8-1. The file written by OpenPyXL (openpyxl.xlsx)Editingwith OpenPyXLThere is no reader/writer package that can truly edit Excel files: in reality, OpenPyXLreads the file with everything it understands, then writes the file again from scratch—including any changes you make in between. This can be very powerful for simpleExcel files that contain mainly formatted cells with data and formulas, but it’s limitedwhen you have charts and other more advanced content in your spreadsheet as162 | Chapter 8: Excel File Manipulation with Reader and Writer PackagesOpenPyXL will either change them or drop them altogether. For example, as ofv3.0.5, OpenPyXL will rename charts and drop their title. Here is a simple editingexample:In [12]: # Read the stores.xlsx file, change a cell # and store it under a new location/name. book = openpyxl.load_workbook("xl/stores.xlsx") book["2019"]["A1"].value = "modified" book.save("stores_edited.xlsx")If you want to write an xlsm file, OpenPyXL has to work off an existing file that youneed to load with the keep_vba parameter set to True:In [13]: book = openpyxl.load_workbook("xl/macro.xlsm", keep_vba=True) book["Sheet1"]["A1"].value = "Click the button!" book.save("macro_openpyxl.xlsm")The button in the example file is calling a macro that shows a message box. Open‐PyXL covers a lot more functionality than I am able to cover in this section; it istherefore a good idea to have a look at the official docs. We will also see more func‐tionality at the end of this chapter when we pick up the case study from the previouschapter again.XlsxWriterAs the name suggests, XlsxWriter can only write Excel files. The following code pro‐duces the same workbook as we previously produced with OpenPyXL, which isshown in Figure 8-1. Note that XlsxWriter uses zero-based cell indices, while Open‐PyXL uses one-based cell indices—make sure to take this into account if you switchbetween packages:In [14]: import datetime as dt import xlsxwriter import excelIn [15]: # Instantiate a workbook book = xlsxwriter.Workbook("xlxswriter.xlsx") # Add a sheet and give it a name sheet = book.add_worksheet("Sheet1") # Writing individual cells using A1 notation # and cell indices (0-based) sheet.write("A1", "Hello 1") sheet.write(1, 0, "Hello 2") # Formatting: fill color, alignment, border and font formatting = book.add_format({"font_color": "#FF0000", "bg_color": "#FFFF00", "bold": True, "align": "center", "border": 1, "border_color": "#FF0000"})The Reader and Writer Packages | 163https://oreil.ly/7qfYL sheet.write("A3", "Hello 3", formatting) # Number formatting (using Excel's formatting strings) number_format = book.add_format({"num_format": "0.00"}) sheet.write("A4", 3.3333, number_format) # Date formatting (using Excel's formatting strings) date_format = book.add_format({"num_format": "mm/dd/yy"}) sheet.write("A5", dt.date(2016, 10, 13), date_format) # Formula: you must use the English name of the formula # with commas as delimiters sheet.write("A6", "=SUM(A4, 2)") # Image sheet.insert_image(0, 2, "images/python.png") # Two-dimensional list (we're using our excel module) data = [[None, "North", "South"], ["Last Year", 2, 5], ["This Year", 3, 6]] excel.write(sheet, data, "A10") # Chart: see the file "sales_report_xlsxwriter.py" in the # companion repo to see how you can work with indices # instead of cell addresses chart = book.add_chart({"type": "column"}) chart.set_title({"name": "Sales per Region"}) chart.add_series({"name": "=Sheet1!A11", "categories": "=Sheet1!B10:C10", "values": "=Sheet1!B11:C11"}) chart.add_series({"name": "=Sheet1!A12", "categories": "=Sheet1!B10:C10", "values": "=Sheet1!B12:C12"}) chart.set_x_axis({"name": "Regions"}) chart.set_y_axis({"name": "Sales"}) sheet.insert_chart("A15", chart) # Closing the workbook creates the file on disk book.close()In comparison to OpenPyXL, XlsxWriter has to take a more complicated approach towrite xlsm files as it is a pure writer package. First, you need to extract the macrocode from an existing Excel file on the Anaconda Prompt (the example uses themacro.xlsm file, which you’ll find in the xl folder of the companion repo):WindowsStart by changing into the xl directory, then find the path to vba_extract.py, ascript that comes with XlsxWriter:164 | Chapter 8: Excel File Manipulation with Reader and Writer Packages(base)> cd C:\Users\username\python-for-excel\xl(base)> where vba_extract.pyC:\Users\username\Anaconda3\Scripts\vba_extract.pyThen use this path in the following command:(base)> python C:\...\Anaconda3\Scripts\vba_extract.py macro.xlsmmacOSOn macOS, the command is available as executable script and can be run likethis:(base)> cd /Users/username/python-for-excel/xl(base)> vba_extract.py macro.xlsmThis will save the file vbaProject.bin in the directory where you are running the com‐mand. I have also included the extracted file in the xl folder of the companion repo.We will use it in the following sample to write a workbook with a macro button:In [16]: book = xlsxwriter.Workbook("macro_xlxswriter.xlsm") sheet = book.add_worksheet("Sheet1") sheet.write("A1", "Click the button!") book.add_vba_project("xl/vbaProject.bin") sheet.insert_button("A3", {"macro": "Hello", "caption": "Button 1", "width": 130, "height": 35}) book.close()pyxlsbCompared to the other reader libraries, pyxlsb offers less functionality but it’s youronly option when it comes to reading Excel files in the binary xlsb format. pyxlsb isnot part of Anaconda, so you will need to install it if you haven’t already done this. Itis currently not available via Conda either, so use pip to install it:(base)> pip install pyxlsbYou read sheets and cell values as follows:In [17]: import pyxlsb import excelIn [18]: # Loop through sheets. With pyxlsb, the workbook # and sheet objects can be used as context managers. # book.sheets returns a list of sheet names, not objects! # To get a sheet object, use get_sheet() instead. with pyxlsb.open_workbook("xl/stores.xlsb") as book: for sheet_name in book.sheets: with book.get_sheet(sheet_name) as sheet: dim = sheet.dimension print(f"Sheet '{sheet_name}' has " f"{dim.h} rows and {dim.w} cols")The Reader and Writer Packages | 165Sheet '2019' has 7 rows and 5 colsSheet '2020' has 7 rows and 5 colsSheet '2019-2020' has 20 rows and 5 colsIn [19]: # Read in the values of a range of cells by using our excel module. # Instead of "2019", you could also use its index (1-based). with pyxlsb.open_workbook("xl/stores.xlsb") as book: with book.get_sheet("2019") as sheet: data = excel.read(sheet, "B2") data[:2] # Print the first two rowsOut[19]: [['Store', 'Employees', 'Manager', 'Since', 'Flagship'], ['New York', 10.0, 'Sarah', 43301.0, False]]pyxlsb currently offers no way of recognizing cells with dates, so you will have tomanually convert values from date-formatted cells into datetime objects like so:In [20]: from pyxlsb import convert_date convert_date(data[1][3])Out[20]: datetime.datetime(2018, 7, 20, 0, 0)Remember, when you read the xlsb file format with a pandasversion below 1.3, youneed to specify the engine explicitly:In [21]: df = pd.read_excel("xl/stores.xlsb", engine="pyxlsb")xlrd, xlwt, and xlutilsThe combination of xlrd, xlwt, and xlutils offers roughly the same functionality forthe legacy xls format that OpenPyXL offers for the xlsx format: xlrd reads, xlwt writes,and xlutils edits xls files. These packages aren’t actively developed anymore, but theyare likely going to be relevant as long as there are still xls files around. xlutils is notpart of Anaconda, so install it if you haven’t already:(base)> conda install xlutilsLet’s get started with the reading part!Reading with xlrdThe following sample code shows you how to read the values from an Excel work‐book with xlrd:In [22]: import xlrd import xlwt from xlwt.Utils import cell_to_rowcol2 import xlutils import excelIn [23]: # Open the workbook to read cell values. The file is # automatically closed again after loading the data. book = xlrd.open_workbook("xl/stores.xls")166 | Chapter 8: Excel File Manipulation with Reader and Writer PackagesIn [24]: # Get a list with all sheet names book.sheet_names()Out[24]: ['2019', '2020', '2019-2020']In [25]: # Loop through the sheet objects for sheet in book.sheets(): print(sheet.name)201920202019-2020In [26]: # Get a sheet object by name or index (0-based) sheet = book.sheet_by_index(0) sheet = book.sheet_by_name("2019")In [27]: # Dimensions sheet.nrows, sheet.ncolsOut[27]: (8, 6)In [28]: # Read the value of a single cell # using "A1" notation and using cell indices (0-based). # The "*" unpacks the tuple that cell_to_rowcol2 returns # into individual arguments. sheet.cell(*cell_to_rowcol2("B3")).value sheet.cell(2, 1).valueOut[28]: 'New York'In [29]: # Read in a range of cell values by using our excel module data = excel.read(sheet, "B2") data[:2] # Print the first two rowsOut[29]: [['Store', 'Employees', 'Manager', 'Since', 'Flagship'], ['New York', 10.0, 'Sarah', datetime.datetime(2018, 7, 20, 0, 0), False]]Used RangeUnlike OpenPyXL and pyxlsb, xlrd returns the dimensions of cellswith a value, instead of the used range of a sheet when usingsheet.nrows and sheet.ncols. What Excel returns as used rangeoften contains empty rows and columns at the bottom and at theright border of the range. This can, for example, happen when youdelete the content of rows (by hitting the Delete key), rather thandeleting the rows themselves (by right-clicking and selectingDelete).The Reader and Writer Packages | 167Writing with xlwtThe following code reproduces what we have done previously with OpenPyXL andXlsxWriter as shown in Figure 8-1. xlwt, however, cannot produce charts and onlysupports the bmp format for pictures:In [30]: import xlwt from xlwt.Utils import cell_to_rowcol2 import datetime as dt import excelIn [31]: # Instantiate a workbook book = xlwt.Workbook() # Add a sheet and give it a name sheet = book.add_sheet("Sheet1") # Writing individual cells using A1 notation # and cell indices (0-based) sheet.write(*cell_to_rowcol2("A1"), "Hello 1") sheet.write(r=1, c=0, label="Hello 2") # Formatting: fill color, alignment, border and font formatting = xlwt.easyxf("font: bold on, color red;" "align: horiz center;" "borders: top_color red, bottom_color red," "right_color red, left_color red," "left thin, right thin," "top thin, bottom thin;" "pattern: pattern solid, fore_color yellow;") sheet.write(r=2, c=0, label="Hello 3", style=formatting) # Number formatting (using Excel's formatting strings) number_format = xlwt.easyxf(num_format_str="0.00") sheet.write(3, 0, 3.3333, number_format) # Date formatting (using Excel's formatting strings) date_format = xlwt.easyxf(num_format_str="mm/dd/yyyy") sheet.write(4, 0, dt.datetime(2012, 2, 3), date_format) # Formula: you must use the English name of the formula # with commas as delimiters sheet.write(5, 0, xlwt.Formula("SUM(A4, 2)")) # Two-dimensional list (we're using our excel module) data = [[None, "North", "South"], ["Last Year", 2, 5], ["This Year", 3, 6]] excel.write(sheet, data, "A10") # Picture (only allows to add bmp format) sheet.insert_bitmap("images/python.bmp", 0, 2)168 | Chapter 8: Excel File Manipulation with Reader and Writer Packages # This writes the file to disk book.save("xlwt.xls")Editing with xlutilsxlutils acts as a bridge between xlrd and xlwt. This makes it explicit that this is not atrue editing operation: the spreadsheet is read including the formatting via xlrd (bysetting formatting_info=True) and then written out again by xlwt, including thechanges that were made in between:In [32]: import xlutils.copyIn [33]: book = xlrd.open_workbook("xl/stores.xls", formatting_info=True) book = xlutils.copy.copy(book) book.get_sheet(0).write(0, 0, "changed!") book.save("stores_edited.xls")At this point, you know how to read and write an Excel workbook in a specific for‐mat. The next section moves on with a few advanced topics that include working withbig Excel files and using pandas and the reader and writer packages together.Advanced Reader and Writer TopicsIf your files are bigger and more complex than the simple Excel files we used in theexamples so far, relying on the default options may not be good enough anymore.Therefore, we start this section by looking at how to work with bigger files. Then,we’ll learn how to use pandas together with the reader and writer packages: this willopen up the ability to style your pandas DataFrames the way you want. To concludethis section, we will use everything we learned in this chapter to make the Excelreport from last chapter’s case study look much more professional.Working with Big Excel FilesWorking with big files can cause two issues: the reading and writing process may beslow or your computer may run out of memory. Usually, the memory issue is of big‐ger concern as it will cause your program to crash. When exactly a file is consideredbig always depends on the available resources on your system and your definition ofslow. This section shows optimization techniques offered by the individual packages,allowing you to work with Excel files that push the limits. I’ll start by looking at theoptions for the writer libraries, followed by the options for the reader libraries. At theend of this section, I’ll show you how to read the sheets of a workbook in parallel toreduce processing time.Advanced Reader and Writer Topics | 169Writing with OpenPyXLWhen writing large files with OpenPyXL, make sure to have the lxml packageinstalled, as this makes the writing process faster. It is included in Anaconda, sothere’s nothing you need to do about that. The critical option, though, is thewrite_only=True flag, which makes sure that the memory consumption remains low.It, however, forces you to write row by row by using the append method and won’tallow you to write single cells anymore:In [34]: book = openpyxl.Workbook(write_only=True) # With write_only=True, book.active doesn't work sheet = book.create_sheet() # This will produce a sheet with 1000 x 200 cells for row in range(1000): sheet.append(list(range(200))) book.save("openpyxl_optimized.xlsx")Writing with XlsxWriterXlsxWriter has a similar option like OpenPyXL called constant_memory. It forces youto write sequential rows, too. You enable the option by providing an options dictio‐nary like this:In [35]: book = xlsxwriter.Workbook("xlsxwriter_optimized.xlsx", options={"constant_memory": True}) sheet = book.add_worksheet() # This will produce a sheet with 1000 x 200 cells for row in range(1000): sheet.write_row(row , 0, list(range(200))) book.close()Reading with xlrdWhen reading big files in the legacy xls format, xlrd allows you to load sheets ondemand, like this:In [36]: with xlrd.open_workbook("xl/stores.xls", on_demand=True) as book: sheet = book.sheet_by_index(0) # Only loads the first sheetIf you wouldn’t use the workbook as a context manager as we do here, you wouldneed to call book.release_resources() manually to properly close the workbookagain. To use xlrd in this mode with pandas, use it like this:In [37]: with xlrd.open_workbook("xl/stores.xls", on_demand=True) as book: with pd.ExcelFile(book, engine="xlrd") as f: df = pd.read_excel(f, sheet_name=0)170 | Chapter 8: Excel File Manipulation with Reader and Writer PackagesReading with OpenPyXLTo keep memory under control when reading big Excel files with OpenPyXL, youshould load the workbook with read_only=True. Since OpenPyXL doesn’t supportthe with statement, you will need to make sure to close the file again when you’redone. If your file contains links to external workbooks, you may additionally want touse keep_links=False to make it faster. keep_links makes sure that the referencesto external workbooks are kept, which may unnecessarily slow down the process ifyou are only interested in reading the values of a workbook:In [38]: book = openpyxl.load_workbook("xl/big.xlsx", data_only=True, read_only=True, keep_links=False) # Perform the desired read operations here book.close() # Required with read_only=TrueReading sheets in parallelWhen you use pandas’ read_excel function to read in multiple sheets of a big work‐book, you will find that this takes a long time (we’ll get to a concrete example in amoment). The reason is that pandas reads sheets sequentially, i.e., one after another.To speed things up, you could read the sheets in parallel. While there is no easy wayto parallelize the writing of workbooks due to how the files are structured internally,reading multiple sheets in parallel is simple enough. However, since parallelization isan advanced topic, I left it out of the Python introduction and won’t go into detailshere either.In Python, if you want to take advantage of the multiple CPU cores that everymodern computer has, you use the multiprocessing package that is part of the stan‐dard library. This will spawn multiple Python interpreters (usually one per CPUcore), which work on a task in parallel. Instead of processing one sheet after another,you have one Python interpreter process the first sheet, while at the same time a sec‐ond Python interpreter is processing the second sheet, etc. However, every additionalPython interpreter takes some time to start up and uses additional memory, so if youhave small files, they will most likely run slower when you parallelize the reading pro‐cess instead of faster. In the case of a big file with multiple big sheets, multiprocessingcan speed up the process substantially, though—always assuming that your systemhas the required memory to handle the workload. If you run the Jupyter notebook onBinder as shown in Chapter 2, you won’t have enough memory and hence, the paral‐lelized version will run slower. In the companion repo, you will find parallel_pan‐das.py, which is a simple implementation for reading the sheets in parallel, usingOpenPyXL as the engine. It’s simple to use, so you won’t need to know anything aboutmultiprocessing:import parallel_pandasparallel_pandas.read_excel(filename, sheet_name=None)Advanced Reader and Writer Topics | 171By default, it will read in all sheets but you can provide a list of sheet names that youwant to process. Like pandas, the function returns a dictionary in the following form:{"sheetname": df}, i.e., keys are the sheet names and the values are the DataFrames.The %%time Magic CommandIn the following samples, I am going to make use of the %%time cell magic. I intro‐duced magic commands in Chapter 5 in connection with Matplotlib. %%time is a cellmagic that can be very useful for simple performance tuning as it makes it easy tocompare the execution time of two cells with different code snippets. Wall time is theelapsed time from the start to the end of the program, i.e., the cell. If you are onmacOS or Linux, you will not just get the wall time but an additional line for CPUtimes along these lines:CPU times: user 49.4 s, sys: 108 ms, total: 49.5 sCPU times measures the time spent on the CPU, which can be lower than the walltime (if the program has to wait for the CPU to become available) or higher (if theprogram is running on multiple CPU cores in parallel). To measure the time moreaccurately, use %%timeit instead of %%time, which runs the cell multiple times andtakes the average of all runs. %%time and %%timeit are cell magics, i.e., they need to beon the first line of the cell and will measure the execution time of the whole cell. If,instead, you want to measure just a single line, start that line with %time or %timeit.Let’s see how much faster the parallelized version reads the big.xlsx file that you willfind in the companion repo’s xl folder:In [39]: %%time data = pd.read_excel("xl/big.xlsx", sheet_name=None, engine="openpyxl")Wall time: 49.5 sIn [40]: %%time import parallel_pandas data = parallel_pandas.read_excel("xl/big.xlsx", sheet_name=None)Wall time: 12.1 sTo get the DataFrame that represents Sheet1, you would write data["Sheet1"] inboth cases. Looking at the wall time of both samples, you’ll see that the parallelizedversion was multiple times faster than pd.read_excel with this particular workbookand on my laptop with 6 CPU cores. If you want it even faster, parallelize OpenPyXLdirectly: you will also find an implementation for that in the companion repository(parallel_openpyxl.py), together with an implementation for xlrd to read the legacy xlsformat in parallel (parallel_xlrd.py). Going through the underlying packages insteadof pandas will allow you to skip the transformation into a DataFrame or only apply172 | Chapter 8: Excel File Manipulation with Reader and Writer Packagesthe cleaning steps that you need, which will most likely help you make things faster—if that is your biggest concern.Reading a Sheet in Parallel with ModinIf you are only reading in one huge sheet, it is worth looking at Modin, a project thatacts as a drop-in replacement for pandas. It parallelizes the reading process of a singlesheet and provides impressive speed improvements. Since Modin requires a specificversion of pandas, it could downgrade the version that comes with Anaconda whenyou install it. If you want to test it, I would recommend that you create a separateConda environment for this to ensure you are not messing up your base environ‐ment. See Appendix A for more detailed instructions on how to create a Condaenvironment:(base)> conda create --name modin python=3.8 -y(base)> conda activate modin(modin)> conda install -c conda-forge modin -yOn my machine and using the big.xlsx file, running the following code took roughlyfive seconds while it took pandas about twelve seconds:import modin.pandasdata = modin.pandas.read_excel("xl/big.xlsx", sheet_name=0, engine="openpyxl")Now that you know how to deal with big files, let’s move on and see howwe can usepandas and the low-level packages together to improve the default formatting whenwriting DataFrames to Excel files!Formatting DataFrames in ExcelTo format DataFrames in Excel the way we want, we can write code that uses pandastogether with OpenPyXL or XlsxWriter. We’ll first use this combination to add a titleto the exported DataFrame. We’ll then format a DataFrame’s header and index beforewrapping this section up by formatting the data part of a DataFrame. Combiningpandas with OpenPyXL for reading can also be occasionally useful, so let’s startwith this:In [41]: with pd.ExcelFile("xl/stores.xlsx", engine="openpyxl") as xlfile: # Read a DataFrame df = pd.read_excel(xlfile, sheet_name="2020") # Get the OpenPyXL workbook object book = xlfile.book # From here on, it's OpenPyXL codeAdvanced Reader and Writer Topics | 173https://oreil.ly/wQszH sheet = book["2019"] value = sheet["B3"].value # Read a single valueWhen writing workbooks, it works analogously, allowing us to easily add a title to ourDataFrame report:In [42]: with pd.ExcelWriter("pandas_and_openpyxl.xlsx", engine="openpyxl") as writer: df = pd.DataFrame({"col1": [1, 2, 3, 4], "col2": [5, 6, 7, 8]}) # Write a DataFrame df.to_excel(writer, "Sheet1", startrow=4, startcol=2) # Get the OpenPyXL workbook and sheet objects book = writer.book sheet = writer.sheets["Sheet1"] # From here on, it's OpenPyXL code sheet["A1"].value = "This is a Title" # Write a single cell valueThese samples use OpenPyXL, but it works conceptually the same with the otherpackages. Let’s now continue with finding out how we can format the index andheader of a DataFrame.Formatting a DataFrame’s index and headersThe easiest way to get complete control over the formatting of the index and columnheaders is to simply write them yourself. The following sample shows you how to dothis with OpenPyXL and XlsxWriter, respectively. You can see the output inFigure 8-2. Let’s start by creating a DataFrame:In [43]: df = pd.DataFrame({"col1": [1, -2], "col2": [-3, 4]}, index=["row1", "row2"]) df.index.name = "ix" dfOut[43]: col1 col2 ix row1 1 -3 row2 -2 4To format the index and headers with OpenPyXL, do as follows:In [44]: from openpyxl.styles import PatternFillIn [45]: with pd.ExcelWriter("formatting_openpyxl.xlsx", engine="openpyxl") as writer: # Write out the df with the default formatting to A1 df.to_excel(writer, startrow=0, startcol=0) # Write out the df with custom index/header formatting to A6 startrow, startcol = 0, 5 # 1. Write out the data part of the DataFrame df.to_excel(writer, header=False, index=False,174 | Chapter 8: Excel File Manipulation with Reader and Writer Packages startrow=startrow + 1, startcol=startcol + 1) # Get the sheet object and create a style object sheet = writer.sheets["Sheet1"] style = PatternFill(fgColor="D9D9D9", fill_type="solid") # 2. Write out the styled column headers for i, col in enumerate(df.columns): sheet.cell(row=startrow + 1, column=i + startcol + 2, value=col).fill = style # 3. Write out the styled index index = [df.index.name if df.index.name else None] + list(df.index) for i, row in enumerate(index): sheet.cell(row=i + startrow + 1, column=startcol + 1, value=row).fill = styleTo format the index and headers with XlsxWriter instead, you’ll need to adjust thecode slightly:In [46]: # Formatting index/headers with XlsxWriter with pd.ExcelWriter("formatting_xlsxwriter.xlsx", engine="xlsxwriter") as writer: # Write out the df with the default formatting to A1 df.to_excel(writer, startrow=0, startcol=0) # Write out the df with custom index/header formatting to A6 startrow, startcol = 0, 5 # 1. Write out the data part of the DataFrame df.to_excel(writer, header=False, index=False, startrow=startrow + 1, startcol=startcol + 1) # Get the book and sheet object and create a style object book = writer.book sheet = writer.sheets["Sheet1"] style = book.add_format({"bg_color": "#D9D9D9"}) # 2. Write out the styled column headers for i, col in enumerate(df.columns): sheet.write(startrow, startcol + i + 1, col, style) # 3. Write out the styled index index = [df.index.name if df.index.name else None] + list(df.index) for i, row in enumerate(index): sheet.write(startrow + i, startcol, row, style)With the index and header formatted, let’s see how we can style the data part!Figure 8-2. A DataFrame with the default format (left) and with a custom format (right)Advanced Reader and Writer Topics | 175Formatting a DataFrame’s data partThe possibilities you have to format the data part of a DataFrame depend on thepackage you’re using: if you use pandas’ to_excel method, OpenPyXL can apply aformat to each cell, while XlsxWriter can only apply formats on a row or columnbasis. For example, to set the number format of the cells to three decimals and center-align the content as shown in Figure 8-3, do the following with OpenPyXL:In [47]: from openpyxl.styles import AlignmentIn [48]: with pd.ExcelWriter("data_format_openpyxl.xlsx", engine="openpyxl") as writer: # Write out the DataFrame df.to_excel(writer) # Get the book and sheet objects book = writer.book sheet = writer.sheets["Sheet1"] # Formatting individual cells nrows, ncols = df.shape for row in range(nrows): for col in range(ncols): # +1 to account for the header/index # +1 since OpenPyXL is 1-based cell = sheet.cell(row=row + 2, column=col + 2) cell.number_format = "0.000" cell.alignment = Alignment(horizontal="center")For XlsxWriter, adjust the code as follows:In [49]: with pd.ExcelWriter("data_format_xlsxwriter.xlsx", engine="xlsxwriter") as writer: # Write out the DataFrame df.to_excel(writer) # Get the book and sheet objects book = writer.book sheet = writer.sheets["Sheet1"] # Formatting the columns (individual cells can't be formatted) number_format = book.add_format({"num_format": "0.000", "align": "center"}) sheet.set_column(first_col=1, last_col=2, cell_format=number_format)176 | Chapter 8: Excel File Manipulation with Reader and Writer PackagesFigure 8-3. A DataFrame with a formatted data partAs an alternative, pandas offers experimental support for the style property of Data‐Frames. Experimental means that the syntax can change at any point in time. Sincestyles were introduced to format the DataFrames in HTML format, they use CSS syn‐tax. CSS stands for cascading style sheets and is used to define the style of HTML ele‐ments. To apply the same format as in the previous example (three decimals andcenter align), you’ll need to apply a function to every element of a Styler objectviaapplymap. You get a Styler object via the df.style attribute:In [50]: df.style.applymap(lambda x: "number-format: 0.000;" "text-align: center")\ .to_excel("styled.xlsx")The outcome of this code is the same as shown in Figure 8-3. For more details on theDataFrame style approach, please refer directly to the styling docs.Without having to rely on the style attribute, pandas offers support to format the dateand datetime objects as shown in Figure 8-4:In [51]: df = pd.DataFrame({"Date": [dt.date(2020, 1, 1)], "Datetime": [dt.datetime(2020, 1, 1, 10)]}) with pd.ExcelWriter("date.xlsx", date_format="yyyy-mm-dd", datetime_format="yyyy-mm-dd hh:mm:ss") as writer: df.to_excel(writer)Figure 8-4. A DataFrame with formatted datesAdvanced Reader and Writer Topics | 177https://oreil.ly/_JzfPOther Reader and Writer PackagesApart from the packages that we have looked at in this chapter, there are a few othersthat may be interesting for specific use cases:pyexcelpyexcel offers a harmonized syntax across different Excel packages and other fileformats including CSV files and OpenOffice files.PyExcelerateThe goal of PyExcelerate is to write Excel files in the fastest possible way.pylightxlpylightxl can read xlsx and xlsm files and write xlsx files.styleframestyleframe wraps pandas and OpenPyXL to produce Excel files with nicely for‐matted DataFrames.oletoolsoletools is not a classic reader or writer package but can be used to analyzeMicrosoft Office documents, e.g., for malware analysis. It offers a convenient wayto extract VBA code from Excel workbooks.Now that you know how to format DataFrames in Excel, it’s time to take another stabat the case study from the previous chapter and see if we can improve the Excel reportwith the knowledge of this chapter!Case Study (Revisited): Excel ReportingHaving made it to the end of this chapter, you know enough to be able to go back tothe Excel report from last chapter’s case study and make it visually more appealing. Ifyou like, go back to sales_report_pandas.py in the companion repository and try toturn it into the report as shown in Figure 8-5.The red numbers are sales figures that are below 20,000. I haven’t touched everyaspect of formatting in this chapter (like how to apply conditional formatting), so youwill have to use the documentation of the package you choose to work with. To com‐pare your solution, I have included two versions of the script that produce this reportin the companion repo. The first version is based on OpenPyXL (sales_report_open‐pyxl.py) and the other one is based on XlsxWriter (sales_report_xlsxwriter.py). Seeingthe scripts side-by-side may also allow you to make a more educated decision aboutwhich package you want to pick for your next writer task. We will get back to thiscase study one more time in the next chapter: there, we’ll rely on an installation ofMicrosoft Excel to work with report templates.178 | Chapter 8: Excel File Manipulation with Reader and Writer Packageshttp://pyexcel.orghttps://oreil.ly/yJax7https://oreil.ly/efjt4https://oreil.ly/nQUg9https://oreil.ly/SG-JyFigure 8-5. The revisited sales report as created by sales_report_openpyxl.pyConclusionIn this chapter, I introduced you to the reader and writer packages that pandas usesunder the hood. Using them directly allows us to read and write Excel workbookswithout needing to have pandas installed. However, using them in combination withpandas enables us to enhance the Excel DataFrame reports by adding titles, charts,and formatting. While the current reader and writer packages are incredibly power‐ful, I still hope that we’ll see a “NumPy moment” one day that unites the efforts of allthe developers into a single project. It would be great to know which package to usewithout having to look at a table first and without having to use a different syntax foreach type of Excel file. In that sense, it makes sense to start with pandas and only fallback to the reader and writer packages when you need additional functionality thatpandas doesn’t cover.Conclusion | 179Excel, however, is so much more than just a data file or a report: the Excel applicationis one of the most intuitive user interfaces where users can type in a few numbers andget it to display the information they are looking for. Automating the Excel applica‐tion instead of reading and writing Excel files opens up a whole new range of func‐tionality that we are going to explore in Part IV. The next chapter starts this journeyby showing you how to control Excel from Python remotely.180 | Chapter 8: Excel File Manipulation with Reader and Writer PackagesPART IVProgramming the Excel Applicationwith xlwingsCHAPTER 9Excel AutomationSo far, we’ve learned how to replace typical Excel tasks with pandas (Part II) and howto use Excel files both as a data source as well as a file format for your reports(Part III). This chapter kicks off Part IV, where we switch away from manipulatingExcel files with the reader and writer packages and begin automating the Excel appli‐cation with xlwings.The main use case of xlwings is to build interactive applications where Excel spread‐sheets act as the user interface, allowing you to call Python by clicking a button orcalling a user-defined function—that’s the type of functionality that isn’t covered bythe reader and writer packages. But that doesn’t mean that xlwings can’t be used toread and write files, as long as you are on either macOS or Windows and have Excelinstalled. One advantage that xlwings has in this area is the ability to truly edit Excelfiles, in all formats, without changing or losing any of the existing content or format‐ting. Another advantage is that you can read the cell values from an Excel workbookwithout the need to save it first. It can, however, also make perfect sense to use anExcel reader/writer package and xlwings together, as we will see when we pick up thereporting case study from Chapter 7 one more time.I’ll start this chapter by introducing you to the Excel object model as well as xlwings:we’ll first learn the basics like connecting to a workbook or reading and writing cellvalues before digging a bit deeper to understand how converters and options allow usto work with pandas DataFrames and NumPy arrays. We also look at how to interactwith charts, pictures, and defined names before moving on to the last section, whichexplains how xlwings works under the hood: this will give you the required knowl‐edge to make your scripts performant as well as work around missing functionality.1831 On Windows, you need at least Excel 2007, and on macOS, you need at least Excel 2016. Alternatively, youcan install the desktop version of Excel, which is part of your Microsoft 365 subscription. Check your sub‐scription for details on how to do this.From this chapter on, you will need to run the code samples on either Windows ormacOS, as they depend on a local installation of Microsoft Excel.1Getting Started with xlwingsOne goal of xlwings is to serve as a drop-in replacement for VBA, allowing you tointeract with Excel from Python on Windows and macOS. Since Excel’s grid is theperfect layout to display Python’s data structures like nested lists, NumPy arrays, andpandas DataFrames, one of xlwings’ core features is to make reading and writingthem from and to Excel as easy as possible. I’ll start this section by introducing you toExcel as a data viewer—this is useful when you are interacting with DataFrames in aJupyter notebook. I’ll then explain the Excel object model before exploring it interac‐tively with xlwings. To wrap this section up, I’ll show you how to call VBA code thatyou may still have in legacy workbooks. Since xlwings is part of Anaconda, we don’tneed to install it manually.Using Excel as Data ViewerYou probably noticed in the previous chapters that by default, Jupyter notebooks hidethe majority of data for bigger DataFrames and only show the top and bottom rowsas well as the first and last few columns. One way to get a better feeling for your datais to plot it—this allows you to spot outliers or other irregularities. Sometimes, how‐ever, it’s just really helpful to be able to scroll through a data table. After readingChapter 7, you know how to use the to_excel method on your DataFrame. Whilethis works, it can be a bit cumbersome: you need to give the Excel file a name, find iton the file system, open it, and, after making changes to your DataFrame, you need toclose the Excel file and run the whole process again. A better idea may be to rundf.to_clipboard(), which copies the DataFrame df to the clipboard, allowing youto paste it into Excel, but there is an even simpler way—use the view function thatcomes with xlwings:In [1]: # First, let's import the packages that we"ll use in this chapter import datetime as dt import xlwings as xw import pandas as pd import numpy as np184 | Chapter 9: Excel Automation2 Note that xlwings 0.22.0 introduced the xw.load function, which is similar to xw.view, but works in the oppo‐site direction: it allows you to load an Excel range easily into a Jupyter notebook as a pandas DataFrame, seethe docs.In [2]: # Let's create a DataFrame based on pseudorandom numbers and # with enough rows that only the head and tail are shown df = pd.DataFrame(data=np.random.randn(100, 5), columns=[f"Trial {i}" for i in range(1, 6)]) dfOut[2]: Trial 1 Trial 2 Trial 3 Trial 4 Trial 5 0 -1.313877 1.164258 -1.306419 -0.529533 -0.524978 1 -0.854415 0.022859 -0.246443 -0.229146 -0.005493 2 -0.327510 -0.492201 -1.353566 -1.229236 0.024385 3 -0.728083 -0.080525 0.628288 -0.382586 -0.590157 4 -1.227684 0.498541 -0.266466 0.297261 -1.297985 .. ... ... ... ... ... 95 -0.903446 1.103650 0.033915 0.336871 0.345999 96 -1.354898 -1.290954 -0.738396 -1.102659 0.115076 97 -0.070092 -0.416991 -0.203445 -0.686915 -1.163205 98 -1.201963 0.471854 -0.458501 -0.357171 1.954585 99 1.863610 0.214047 -1.426806 0.751906 -2.338352 [100 rows x 5 columns]In [3]: # View the DataFrame in Excel xw.view(df)The view function accepts all common Python objects, including numbers, strings,lists, dictionaries, tuples, NumPy arrays, and pandas DataFrames. By default, it opensa new workbook and pastes the object into cell A1 of the first sheet—it even adjuststhe column widths by using Excel’s AutoFit functionality. Instead of opening a newworkbook every time, you can also reuse the same one by providing the view func‐tion an xlwings sheet object as the second argument: xw.view(df, mysheet). Howyou get access to such a sheet object and how it fits into the Excel object model iswhat I will explain next.2Getting Started with xlwings | 185https://oreil.ly/x7sTRmacOS: Permissions and PreferencesOn macOS, make sure to run Jupyter notebooks and VS Code froman Anaconda Prompt (i.e., via Terminal) as shown in Chapter 2.This ensures that you will be greeted by two pop-ups when you usexlwings for the first time: the first one is “Terminal wants access tocontrol System Events” and the second one is “Terminal wantsaccess to control Microsoft Excel.” You will need to confirm bothpop-ups to allow Python to automate Excel. In theory, these pop-ups should be triggered by any application from which you runxlwings code, but in practice, that’s often not the case, so runningthem via the Terminal will keep you out of trouble. Also, you’llneed to open Excel’s Preferences and uncheck “Show WorkbookGallery when opening Excel” under the General category. Thisopens Excel directly on an empty workbook instead of opening thegallery first, which would get in your way when you open a newExcel instance via xlwings.The Excel Object ModelWhen you work with Excel programmatically, you interact with its components like aworkbook or a sheet. These components are organized in the Excel object model, ahierarchical structure that represents Excel’s graphical user interface (see Figure 9-1).Microsoft largely uses the same object model with all programming languages theyofficially support, whether that’s VBA, Office Scripts (the JavaScript interface forExcel on the web), or C#. In contrast to the reader and writer packages from Chap‐ter 8, xlwings follows the Excel object model very closely, only with a breath of freshair: for example, xlwings uses the names app instead of application and book insteadof workbook:• An app contains the books collection• A book contains the sheets collection• A sheet gives access to range objects and collections such as charts• A range contains one or more contiguous cells as its itemsThe dashed boxes are collections and contain one or more objects of the same type.An app corresponds to an Excel instance, i.e., an Excel application that runs as a sepa‐rate process. Power users sometimes use multiple Excel instances in parallel to openthe same workbook twice, for example, to calculate a workbook with different inputsin parallel. With the more recent versions of Excel, Microsoft made it slightly morecomplicated to open multiple instances of Excel manually: start Excel, then right-clickon its icon in the Windows taskbar. In the appearing menu, left-click on the Excelentry while holding down the Alt key at the same time (make sure to keep the Alt key186 | Chapter 9: Excel Automation3 See “What are Excel instances, and why is this important?” for more information about separate Excel instan‐ces.pressed until after you release your mouse button)—a pop-up will ask if you want tostart a new instance of Excel. On macOS, there is no manual way of launching morethan one instance of the same program but you can launch multiple Excel instancesprogrammatically via xlwings, as we will see later. To summarize, an Excel instance isa sandboxed environment, which means that one instance can’t communicate with theother one.3 The sheet object gives you access to collections like charts, pictures, anddefined names—topics that we will look into in the second section of this chapter.Figure 9-1. The Excel object model as implemented by xlwings (excerpt)Language and Regional SettingsThis book is based on the US-English version of Excel. I will occasionally refer todefault names like “Book1” or “Sheet1,” which will be different if you use Excel inanother language. For example, “Sheet1” is called “Feuille1” in French and “Hoja1” inSpanish. Also, the list separator, which is the separator that Excel uses in cell formulas,depends on your settings: I will be using the comma, but your version may require asemicolon or another character. For example, instead of writing =SUM(A1, A2), youwill need to write =SUMME(A1; A2) on a computer with German regional settings.On Windows, if you wanted to change the list separator from a semicolon to acomma, you need to change it outside of Excel via your Windows settings: click onthe Windows start button, search for Settings (or click the cog icon), then go to “Time& Language” > “Region & language” > “Additional date, time & regional settings”where you finally click on “Region” > “Change location.” Under “List separator,” youwill be able to change it from a semicolon to a comma. Bear in mind that this onlyGetting Started with xlwings | 187https://oreil.ly/L2FDTworks if your “Decimal symbol” (in the same menu) is not also a comma. To overridethe system-wide decimal and thousands separators (but not the list separator), inExcel go to “Options” > “Advanced,” where you will find the settings under “EditingOptions.”On macOS, it worksto the cloud, where practically unconstrained computing resourcesare waiting for you.Why I Wrote This BookThrough my work on xlwings, the Excel automation package that we will meet inPart IV of this book, I am in close contact with many users who use Python for Excel—whether that’s via the issue tracker on GitHub, a question on StackOverflow or at aphysical event like a meetup or a conference.On a regular basis, I am asked to recommend resources to get started with Python.While there is certainly no shortage of Python introductions, they are often either toogeneral (nothing about data analysis) or too specific (full scientific introductions).However, Excel users tend to be somewhere in the middle: they certainly work withdata, but a full scientific introduction may be too technical. They also often have spe‐cific requirements and questions that aren’t answered in any of the existing material.Some of these questions are:• Which Python-Excel package do I need for which task?• How do I move my Power Query database connection over to Python?• What’s the equivalent of Excel’s AutoFilter or pivot table in Python?I wrote this book to get you from zero Python knowledge to be able to automate yourExcel-centric tasks and leverage Python’s data analysis and scientific computing toolsin Excel without any detours.Who This Book Is ForIf you are an advanced Excel user who wants to beat the limits of Excel with amodern programming language, this book is for you. Most typically, this means thatyou spend hours every month downloading, cleaning, and copy/pasting big amountsof data into mission-critical spreadsheets. While there are different ways to overcomeExcel’s limits, this book will focus on how to use Python for this task.You should have a basic understanding of programming: it helps if you have alreadywritten a function or a for loop (no matter in which programming language) andhave an idea about what an integer or a string is. You might even be able to masterthis book if you are used to writing complex cell formulas or have experience withtweaking recorded VBA macros. You are not expected to have any Python-specificexperience, though, as there are introductions to all the tools that we will use includ‐ing an introduction to Python itself.xii | Prefacehttps://oreil.ly/ZJQkBhttps://stackoverflow.comIf you are a seasoned VBA developer, you will find regular comparisons betweenPython and VBA that will allow you to ship around the common gotchas and hit theground running.This book can also be helpful if you are a Python developer and need to learn aboutthe different ways that Python can deal with the Excel application and Excel files to beable to pick the right package given the requirements of your business users.How This Book Is OrganizedIn this book, I will show you all aspects of the Python for Excel story split into fourparts:Part I: Introduction to PythonThis part starts by looking into the reasons why Python is such an enjoyablecompanion for Excel before introducing the tools we’ll be using in this book: theAnaconda Python distribution, Visual Studio Code, and Jupyter notebooks. Thispart will also teach you enough Python to be able to master the rest of this book.Part II: Introduction to pandaspandas is Python’s go-to library for data analysis. We will learn how to replaceExcel workbooks with a combination of Jupyter notebooks and pandas. Usually,pandas code is both easier to maintain and more efficient than an Excel work‐book, and you can work with datasets that don’t fit into a spreadsheet.Unlike Excel, pandas allows you to run your code wherever you want, includingthe cloud.Part III: Reading and Writing Excel Files without ExcelThis part is about manipulating Excel files by using one of the following Pythonpackages: pandas, OpenPyXL, XlsxWriter, pyxlsb, xlrd, and xlwt. These packagesare able to read and write Excel workbooks directly on disk and as such replacethe Excel application: as you don’t require an installation of Excel, they work onany platform that Python supports, including Windows, macOS, and Linux. Atypical use case for a reader package is to read in data from Excel files that youreceive every morning from an external company or system and store their con‐tents in a database. A typical use case for a writer package is to provide the func‐tionality behind the famous “Export to Excel” button that you find in almostevery application.Part IV: Programming the Excel Application with xlwingsIn this part, we’ll see how we can use Python with the xlwings package to auto‐mate the Excel application rather than reading and writing Excel files on disk.Therefore, this part requires you to have a local installation of Excel. We willlearn how to open Excel workbooks and manipulate them in front of our eyes. Inaddition to reading and writing files via Excel, we will build interactive ExcelPreface | xiii1 Microsoft has started to use the term custom functions instead of UDFs. In this book, I will continue to callthem UDFs.tools: these allow us to click a button to have Python perform something that youmay have done previously with VBA macros, such as a computationally expen‐sive calculation. We’ll also learn how to write user-defined functions1 (UDFs) inPython instead of VBA.It’s important to understand the fundamental difference between reading and writingExcel files (Part III) and programming the Excel application (Part IV) as visualized inFigure P-1.Figure P-1. Reading and writing Excel files (Part III) vs. programming Excel (Part IV)Since Part III doesn’t require an installation of Excel, everything works on all plat‐forms that Python supports, mainly Windows, macOS, and Linux. Part IV, however,will only work on those platforms that Microsoft Excel supports, i.e., Windows andmacOS, as the code relies on a local installation of Microsoft Excel.Python and Excel VersionsThis book is based on Python 3.8, which is the Python version that comes with thelatest version of the Anaconda Python distribution at the time of this writing. If youwant to use a newer version of Python, follow the instructions on the book’s homepage, but make sure that you don’t use an older version. I will occasionally make acomment if something changes with Python 3.9.This book also expects you to use a modern version of Excel, meaning at least Excel2007 on Windows and Excel 2016 on macOS. The locally installed version of Excelthat comes with the Microsoft 365 subscription will also work perfectly—in fact, Ieven recommend it, as it has the latest features that you won’t find in other versionsof Excel. It was also the version I used to write this book, so if you use another versionof Excel, you might sometimes see a small difference in the name or location of amenu item.xiv | Prefacehttps://xlwings.org/bookhttps://xlwings.org/bookConventions Used in This BookThe following typographical conventions are used in this book:ItalicIndicates new terms, URLs, email addresses, filenames, and file extensions.Constant widthUsed for program listings, as well as within paragraphs to refer to program ele‐ments such as variable or function names, databases, data types, environmentvariables, statements, and keywords.Constant width boldShows commands or other text that should be typed literally by the user.Constant width italicShows text that should be replaced with user-supplied values or by values deter‐mined by context.This element signifies a tip or suggestion.This element signifies a general note.This element indicates a warning or caution.Using Code ExamplesI am maintaining a web page with additional information to help you with this book.Make sure to check it out, especially if you run into an issue.Supplemental material (code examples, exercises, etc.) is available for download athttps://github.com/fzumstein/python-for-excel. To download this companion reposi‐tory, click on thesimilarly, except that you can’t change the list separator directly:under System Preferences of your macOS (not Excel), select Language & Region.There, set a specific region either globally (under the General tab) or specifically forExcel (under the Apps tab).To get a feeling for the Excel object model, as usual, it’s best to play around with itinteractively. Let’s start with the Book class: it allows you to create new workbooks andconnect to existing ones; see Table 9-1 for an overview.Table 9-1. Working with Excel workbooksCommand Descriptionxw.Book() Returns a book object representing a new Excel workbook in the active Excelinstance. If there is no active instance, Excel will be started.xw.Book("Book1") Returns a book object representing an unsaved workbook with the nameBook1 (name without file extension).xw.Book("Book1.xlsx") Returns a book object representing a previously saved workbook with thename Book1.xlsx (name with file extension). The file has to be either open orin the current working directory.xw.Book(r"C:\path\Book1.xlsx") Returns a book object of a previously saved workbook (full file path). The filecan be open or closed. The leading r turns the string into a raw string so thebackslashes (\) of the path are interpreted literally on Windows (I introducedraw strings in Chapter 5). On macOS, the r isn’t required as file paths useforward slashes instead of backslashes.xw.books.active Returns a book object representing the active workbook in the active Excelinstance.Let’s see how we can walk through the object model hierarchy from the book objectdown to the range object:In [4]: # Create a new empty workbook and print its name. This is the # book we will use to run most of the code samples in this chapter. book = xw.Book() book.nameOut[4]: 'Book2'In [5]: # Accessing the sheets collection book.sheetsOut[5]: Sheets([])In [6]: # Get a sheet object by index or name. You will need to adjust # "Sheet1" if your sheet is called differently.188 | Chapter 9: Excel Automation sheet1 = book.sheets[0] sheet1 = book.sheets["Sheet1"]In [7]: sheet1.range("A1")Out[7]: With the range object, we have arrived at the bottom of the hierarchy. The string thatgets printed in between angle brackets gives you useful information about that object,but to do something, you usually use the object with an attribute, as the next sampleshows:In [8]: # Most common tasks: write values... sheet1.range("A1").value = [[1, 2], [3, 4]] sheet1.range("A4").value = "Hello!"In [9]: # ...and read values sheet1.range("A1:B2").valueOut[9]: [[1.0, 2.0], [3.0, 4.0]]In [10]: sheet1.range("A4").valueOut[10]: 'Hello!'As you can see, by default, the value attribute of an xlwings range object accepts andreturns a nested list for two-dimensional ranges and a scalar for a single cell. Every‐thing we’ve used so far is almost identical to VBA: assuming that book is a VBA orxlwings workbook object, respectively, this is how you access the value attribute fromcells A1 to B2 in VBA and with xlwings:book.Sheets(1).Range("A1:B2").Value # VBAbook.sheets[0].range("A1:B2").value # xlwingsThe differences are:AttributesPython uses lowercase letters, potentially with underscores as suggested byPEP 8, Python’s style guide that I introduced in Chapter 3.IndexingPython uses square brackets and zero-based indices to access an element in thesheets collection.Table 9-2 gives you an overview of the strings that an xlwings range accepts.Getting Started with xlwings | 189Table 9-2. Strings to define a range in A1 notationReference Description"A1" A Single Cell"A1:B2" Cells from A1 to B2"A:A" Column A"A:B" Columns A to B"1:1" Row 1"1:2" Rows 1 to 2Indexing and slicing work with xlwings range objects—watch the address in betweenangle brackets (the printed object representation) to see what cell range you endup with:In [11]: # Indexing sheet1.range("A1:B2")[0, 0]Out[11]: In [12]: # Slicing sheet1.range("A1:B2")[:, 1]Out[12]: Indexing corresponds to using the Cells property in VBA:book.Sheets(1).Range("A1:B2").Cells(1, 1) # VBAbook.sheets[0].range("A1:B2")[0, 0] # xlwingsInstead of using range explicitly as an attribute of the sheet object, you can also get arange object by indexing and slicing the sheet object. Using this with A1 notationwill allow you to type less, and using this with integer indices makes the Excel sheetfeel like a NumPy array:In [13]: # Single cell: A1 notation sheet1["A1"]Out[13]: In [14]: # Multiple cells: A1 notation sheet1["A1:B2"]Out[14]: In [15]: # Single cell: indexing sheet1[0, 0]Out[15]: In [16]: # Multiple cells: slicing sheet1[:2, :2]190 | Chapter 9: Excel AutomationOut[16]: Sometimes, however, it may be more intuitive to define a range by referring to thetop-left and bottom-right cell of a range. The following samples refer to the cellranges D10 and D10:F11, respectively, allowing you to understand the differencebetween indexing/slicing a sheet object and working with a range object:In [17]: # D10 via sheet indexing sheet1[9, 3]Out[17]: In [18]: # D10 via range object sheet1.range((10, 4))Out[18]: In [19]: # D10:F11 via sheet slicing sheet1[9:11, 3:6]Out[19]: In [20]: # D10:F11 via range object sheet1.range((10, 4), (11, 6))Out[20]: Defining range objects with tuples is very similar to how the Cells property works inVBA, as the following comparison shows—this assumes again that book is either aVBA workbook object or an xlwings book object. Let’s first look at the VBA version:With book.Sheets(1) myrange = .Range(.Cells(10, 4), .Cells(11, 6))End WithThis is equivalent to the following xlwings expression:myrange = book.sheets[0].range((10, 4), (11, 6))Zero vs. One-Based IndicesAs a Python package, xlwings consistently uses zero-based index‐ing whenever you access elements via Python’s index or slice syn‐tax, i.e., via square brackets. xlwings range objects, however, useExcel’s one-based row and column indices. Having the same row/column indices as Excel’s user interface may sometimes be benefi‐cial. If you prefer to only ever use Python’s zero-based indexing,simply use the sheet[row_selection, column_selection]syntax.Getting Started with xlwings | 191The following sample shows you how to get from a range object (sheet1["A1"]) allthe way up again to the app object. Remember that the app object represents an Excelinstance (the output in between angle brackets represents Excel’s process ID and willtherefore be different on your machine):In [21]: sheet1["A1"].sheet.book.appOut[21]: Having arrived at the very top of the Excel object model, it’s a good moment to seehow you can work with multiple Excel instances. You will need to use the app objectexplicitly if you want to open the same workbook in multiple Excel instances or if youspecifically want to distribute your workbooks across different instances for perfor‐mance reasons. Another common use case for working with an app object is to openyour workbook in a hidden Excel instance: this allows you to run an xlwings script inthe background without blocking you from doing other work in Excel in the mean‐time:In [22]: # Get one app object from the open workbook # and create an additional invisible app instance visible_app = sheet1.book.app invisible_app = xw.App(visible=False)In [23]: # List thegreen Code button, then select Download ZIP. Once downloaded,right-click the file on Windows and select Extract All to unzip the contained files intoa folder. On macOS, simply double-click the file to unzip. If you know how to workPreface | xvhttps://xlwings.org/bookhttps://github.com/fzumstein/python-for-excelwith Git, you could also use Git to clone the repository to your local hard disk. Youcan put the folder anywhere you want, but I will refer to it occasionally as follows inthis book:C:\Users\username\python-for-excelBy simply downloading and unzipping the ZIP file on Windows, you’ll end up with afolder structure similar to this one (note the repeated folder names):C:\...\Downloads\python-for-excel-1st-edition\python-for-excel-1st-editionCopying the contents of this folder into one you create under C:\Users\\python-for-excel might make it easier for you to follow along. The same remarks aretrue for macOS, i.e., copy the files to /Users//python-for-excel.If you have a technical question or a problem using the code examples, please send anemail to bookquestions@oreilly.com.This book is here to help you get your job done. In general, if example code is offeredwith this book, you may use it in your programs and documentation. You do notneed to contact us for permission unless you’re reproducing a significant portion ofthe code. For example, writing a program that uses several chunks of code from thisbook does not require permission. Selling or distributing examples from O’Reillybooks does require permission. Answering a question by citing this book and quotingexample code does not require permission. Incorporating a significant amount ofexample code from this book into your product’s documentation does require per‐mission.We appreciate, but generally do not require, attribution. An attribution usuallyincludes the title, author, publisher, and ISBN. For example: “Python for Excel by FelixZumstein (O’Reilly). Copyright 2021 Zoomer Analytics LLC, 978-1-492-08100-5.”If you feel your use of code examples falls outside fair use or the permission givenabove, feel free to contact us at permissions@oreilly.com.O’Reilly Online LearningFor more than 40 years, O’Reilly Media has provided technol‐ogy and business training, knowledge, and insight to helpcompanies succeed.Our unique network of experts and innovators share their knowledge and expertisethrough books, articles, and our online learning platform. O’Reilly’s online learningplatform gives you on-demand access to live training courses, in-depth learningpaths, interactive coding environments, and a vast collection of text and video fromO’Reilly and 200+ other publishers. For more information, visit http://oreilly.com.xvi | Prefacemailto:bookquestions@oreilly.commailto:permissions@oreilly.comhttp://oreilly.comhttp://oreilly.comHow to Contact UsPlease address comments and questions concerning this book to the publisher:O’Reilly Media, Inc.1005 Gravenstein Highway NorthSebastopol, CA 95472800-998-9938 (in the United States or Canada)707-829-0515 (international or local)707-829-0104 (fax)We have a web page for this book, where we list errata, examples, and any additionalinformation. You can access this page at https://oreil.ly/py4excel.Email bookquestions@oreilly.com to comment or ask technical questions about thisbook.For more information about our books, courses, conferences, and news, see our web‐site at http://www.oreilly.com.Find us on Facebook: http://facebook.com/oreilly.Follow us on Twitter: http://twitter.com/oreillymedia.Watch us on YouTube: http://www.youtube.com/oreillymedia.AcknowledgmentsAs a first-time author, I am incredibly grateful for the help I got from so many peoplealong the way—they made this journey a lot easier for me!At O’Reilly, I would like to thank my editor, Melissa Potter, who did a great job inkeeping me motivated and on schedule and who helped me to bring this book into areadable form. I’d also like to thank Michelle Smith, who worked with me on the ini‐tial book proposal, and Daniel Elfanbaum, who never tired of answering my technicalquestions.A big thank you goes to all my colleagues, friends, and clients who invested manyhours in reading the earliest forms of my drafts. Their feedback was crucial to makingthe book easier to understand, and some of the case studies are inspired by real-worldExcel problems that they shared with me. My thanks go to Adam Rodriguez, ManoBeeslar, Simon Schiegg, Rui Da Costa, Jürg Nager, and Christophe de Montrichard.I also got helpful feedback from readers of the Early Release version that was pub‐lished on the O’Reilly online learning platform. Thank you Felipe Maion, Ray Doue,Kolyu Minevski, Scott Drummond, Volker Roth, and David Ruggles!Preface | xviihttps://oreil.ly/py4excelmailto:bookquestions@oreilly.comhttp://www.oreilly.comhttp://facebook.com/oreillyhttp://twitter.com/oreillymediahttp://www.youtube.com/oreillymediaI was very lucky that the book got reviewed by highly qualified tech reviewers and Ireally appreciate the hard work they put in under a lot of time pressure. Thanks forall your help, Jordan Goldmeier, George Mount, Andreas Clenow, Werner Brönni‐mann, and Eric Moreira!Special thanks go to Björn Stiel, who wasn’t just a tech reviewer, but from whom Ialso learned many of the things I am writing about in this book. I’ve enjoyed workingwith you these past few years!Last but not least, I’d like to extend my gratitude to Eric Reynolds, who merged hisExcelPython project into the xlwings code base in 2016. He also redesigned the wholepackage from scratch, making my horrible API from the early days a thing of thepast. Thank you very much!xviii | PrefacePART IIntroduction to PythonCHAPTER 1Why Python for Excel?Usually, Excel users start to question their spreadsheet tools when they hit a limita‐tion. A classic example is when Excel workbooks contain so much data and formulasthat they become slow or in the worst case, crash. It does make sense, though, toquestion your setup before things go south: if you work on mission-critical work‐books where errors can result in financial or reputational damage or if you spendhours every day updating Excel workbooks manually, you should learn how to auto‐mate your processes with a programming language. Automation takes out the risk ofhuman error and allows you to spend your time on more productive tasks than copy/pasting data into an Excel spreadsheet.In this chapter, I will give you a few reasons why Python is an excellent choice incombination with Excel and what its advantages are compared to Excel’s built-inautomation language, VBA. After introducing Excel as a programming language andunderstanding its particularities, I will point out the specific features that makePython so much stronger in comparison with VBA. To start with, however, let’s take aquick look at the origins of our two main characters!In terms of computer technology, Excel and Python have both been around for a verylong time: Excel was first launched in 1985 by Microsoft—and this may come as asurprise to many—it was only available for Apple Macintosh. It wasn’t until 1987 thatMicrosoft Windows got its first version in the form of Excel 2.0. Microsoft wasn’t thefirst player in the spreadsheet market, though: VisiCorp came out with VisiCalc in1979, followed by Lotus Software in 1983 with Lotus 1-2-3. And Microsoft didn’t leadwith Excel: three years earlier, they released Multiplan, a spreadsheet program thatcould be used on MS-DOS and a few other operating systems, but not on Windows.Python was born in 1991, only six years after Excel. While Excel became popularearly on, it took Python a bit longer until it got adopted in certain areas like webdevelopment or system administration. In 2005, Python started to become a serious31 You can read the announcement of lambda functions on theExcel Blog.alternative for scientific computing when NumPy, a package for array-based comput‐ing and linear algebra, was first released. NumPy combined two predecessor packagesand therefore streamlined all development efforts around scientific computing intoa single project. Today, it forms the basis of countless scientific packages, includingpandas, which came out in 2008 and which is largely responsible for the widespreadadoption of Python in the world of data science and finance that started to happenafter 2010. Thanks to pandas, Python, alongside R, has become one of the most com‐monly used languages for data science tasks like data analysis, statistics, and machinelearning.The fact that Python and Excel were both invented a long time ago isn’t the only thingthey have in common: Excel and Python are also both a programming language.While you are probably not surprised to hear that about Python, it may require anexplanation for Excel, which I’ll give you next.Excel Is a Programming LanguageThis section starts by introducing Excel as a programming language, which will helpyou to understand why spreadsheet issues turn up in the news on a regular basis.We’ll then have a look at a few best practices that have emerged in the software devel‐opment community and that can save you from many typical Excel errors. We’ll con‐clude with a brief introduction to Power Query and Power Pivot, two modern Exceltools that cover the sort of functionality for which we will use pandas instead.If you use Excel for more than your grocery list, you are definitely using functionslike =SUM(A1:A4) to sum up a range of cells. If you think for a moment about howthis works, you will notice that the value of a cell usually depends on one or moreother cells, which may again use functions that depend on one or more other cells,and so on. Doing such nested function calls is no different from how other program‐ming languages work, only that you write the code in cells instead of text files. And ifthat didn’t convince you just yet: at the end of 2020, Microsoft announced the intro‐duction of lambda functions, which allow you to write reusable functions in Excel’sown formula language, i.e., without having to rely on a different language like VBA.According to Brian Jones, Excel’s head of product, this was the missing piece thatfinally makes Excel a “real” programming language.1 This also means that Excel usersshould really be called Excel programmers!There is a special thing, though, about Excel programmers: most of them are businessusers or domain experts without a formal education in computer science. They aretraders, accountants, or engineers, to mention just a few examples. Their spreadsheettools are designed to solve a business problem and often ignore best practices in4 | Chapter 1: Why Python for Excel?https://oreil.ly/4-0y22 James Vincent, “Scientists rename human genes to stop Microsoft Excel from misreading them as dates,” TheVerge, August 6, 2020, https://oreil.ly/0qo-n.3 Leo Kelion, “Excel: Why using Microsoft’s tool caused COVID-19 results to be lost,” BBC News, October 5,2020, https://oreil.ly/vvB6o.4 Wikipedia links to the document in one of the footnotes in their article about the case.software development. As a consequence, these spreadsheet tools often mix inputs,calculations, and outputs on the same sheets, they may require nonobvious steps tobe performed for them to work properly, and critical changes are done without anysafety net. In other words, the spreadsheet tools are lacking a solid application archi‐tecture and are often undocumented and untested. Sometimes, these issues can havedevastating consequences: if you forget to recalculate your trading workbook beforeplacing a trade, you may buy or sell the wrong number of shares, which can cause youto lose money. And if it isn’t just your own money you are trading, we can read aboutit in the news, as we’ll see next.Excel in the NewsExcel is a regular guest in the news, and during the course of this writing, two newstories hit the headlines. The first one was about the HUGO Gene NomenclatureCommittee, which renamed a few human genes so they wouldn’t be interpreted byExcel as dates anymore. For example, to prevent that the gene MARCH1 would beturned into 1-Mar, it was renamed into MARCHF1.2 In the second story, Excel wasblamed for the delayed reporting of 16,000 COVID-19 test results in England. Theissue was caused by the test results being written to the older Excel file format (.xls)that was limited to roughly 65,000 rows. This meant that larger datasets were simplycut off beyond that limit.3 While these two stories show the continued importanceand dominance of Excel in today’s world, there is probably no other “Excel incident”that is more famous than the London Whale.London Whale is the nickname of a trader whose trading mistakes forced JP Morganto announce a staggering loss of $6 billion in 2012. The source of the blowup was anExcel-based value-at-risk model that was substantially underestimating the true riskof losing money in one of their portfolios. The Report of JPMorgan Chase & Co. Man‐agement Task Force Regarding 2012 CIO Losses4 (2013) mentions that “the modeloperated through a series of Excel spreadsheets, which had to be completed manually,by a process of copying and pasting data from one spreadsheet to another.” On top ofthese operational issues, they had a logical error: in one calculation, they were divid‐ing by a sum instead of an average.If you want to see more of these stories, have a look at Horror Stories, a web pagemaintained by the European Spreadsheet Risks Interest Group (EuSpRIG).Excel Is a Programming Language | 5https://oreil.ly/0qo-nhttps://oreil.ly/vvB6ohttps://oreil.ly/0uUj9https://oreil.ly/WLO-I5 The terminology is taken from Microsoft Application Architecture Guide, 2nd Edition, which is availableonline.To prevent your company from ending up in the news with a similar story, let’s have alook at a few best practices next that make your work with Excel massively safer.Programming Best PracticesThis section will introduce you to the most important programming best practices,including separation of concerns, the DRY principle, testing, and version control.As we will see, following them will be easier when you start using Python togetherwith Excel.Separation of concernsOne of the most important design principles in programming is separation of con‐cerns, sometimes also referred to as modularity. It means that a related set of func‐tionality should be taken care of by an independent part of the program so it can beeasily replaced without affecting the rest of the application. At the highest level, anapplication is often divided into the following layers:5• Presentation layer• Business layer• Data layerTo explain these layers, consider a simple currency converter like the one shown inFigure 1-1. You’ll find the currency_converter.xlsx Excel file in the xl folder of thecompanion repository.This is how the application works: type in the Amount and Currency into cells A4and B4, respectively, and Excel will convert this into US dollars in cell D4. Manyspreadsheet applications follow such a design and are used by businesses every day.Let me break the application down into its layers:Presentation layerThis is what you see and interact with, i.e., the user interface: the values of cellsA4, B4, and D4 together with their labels build the presentation layer of the cur‐rency converter.Business layerThis layer takes care of the application-specific logic: cell D4 defines how theamount is converted into USD. The formula =A4 * VLOOKUP(B4, F4:G11, 2,FALSE) translates to Amount times Exchange rate.6 | Chapter 1: Why Python for Excel?https://oreil.ly/8V-GSData layerAs the name suggests, this layer takes care of accessing the data: the VLOOKUP partof cell D4 is doing this job.The data layer accesses thedata from the exchange rates table that starts in cell F3 andthat acts as the database of this little application. If you paid close attention, you prob‐ably noticed that cell D4 appears in all three layers: this simple application mixes thepresentation, business, and data layers in a single cell.Figure 1-1. currency_converter.xlsxThis isn’t necessarily an issue for this simple currency converter, but often, what startsoff as a small Excel file turns soon enough into a much bigger application. How canthis situation be improved? Most professional Excel developer resources advise you touse a separate sheet for each layer, in Excel’s terminology usually called inputs, calcu‐lations, and outputs. Often, this is combined with defining a certain color code foreach layer, e.g., a blue background for all input cells. In Chapter 11, we will build areal application based on these layers: Excel will be the presentation layer, while thebusiness and data layers are moved to Python, where it’s much easier to structureyour code properly.Now that you know what separation of concerns means, let’s find out what the DRYprinciple is!Excel Is a Programming Language | 7DRY principleThe Pragmatic Programmer by Hunt and Thomas (Pearson Education) popularizedthe DRY principle: don’t repeat yourself. No duplicated code means fewer lines of codeand fewer errors, which makes the code easier to maintain. If your business logic sitsin your cell formulas, it’s practically impossible to apply the DRY principle, as there isno mechanism that allows you to reuse it in another workbook. This, unfortunately,means that a common way to start a new Excel project is to copy the workbook fromthe previous project or from a template.If you write VBA, the most common piece of reusable code is a function. A functiongives you access to the same code block from multiple macros, for example. If youhave multiple functions that you use all the time, you might want to share thembetween workbooks. The standard instrument to share VBA code across workbooksis add-ins, but VBA add-ins lack a robust way of distributing and updating them.While Microsoft has introduced an Excel internal add-in store to solve that issue, thisonly works with JavaScript-based add-ins, so it’s not an option for VBA coders. Thismeans that it is still very common to use the copy/paste approach with VBA: let’sassume that you need a cubic spline function in Excel. The cubic spline function is away to interpolate a curve based on a few given points in a coordinate system and isoften used by fixed income traders to derive an interest rate curve for all maturitiesbased on a few known maturity/interest rate combinations. If you search for “CubicSpline Excel” on the internet, it won’t take too long until you have a page of VBA codethat does what you want. The issue with this is that most commonly, these functionswere written by a single person with probably good intentions but without formaldocumentation or testing. Maybe they work for the majority of inputs, but whatabout some uncommon edge cases? If you are trading a multimillion fixed-incomeportfolio, you want to have something you know you can trust. At least, that is whatyou will hear from your internal auditors when they find out where the code iscoming from.Python makes it easy to distribute code by using a package manager, as we will see inthe last section of this chapter. Before we get there, however, let’s continue with test‐ing, one of the cornerstones of solid software development.TestingWhen you tell an Excel developer to test their workbooks, they will most likely per‐form a few random checks: click a button and see if the macro still does what it issupposed to do or change a few inputs and check if the output looks reasonable. Thisis, however, a risky strategy: Excel makes it easy to introduce errors that are hard tospot. For example, you can overwrite a formula with a hardcoded value. Or you for‐get to adjust a formula in a hidden column.8 | Chapter 1: Why Python for Excel?When you tell a professional software developer to test their code, they will write unittests. As the name suggests, it’s a mechanism to test individual components of yourprogram. For example, unit tests make sure that a single function of a program worksproperly. Most programming languages offer a way to run unit tests automatically.Running automated tests will increase the reliability of your codebase dramaticallyand make reasonably sure that you won’t break anything that currently works whenyou edit your code.If you look at the currency conversion tool in Figure 1-1, you could write a test thatchecks if the formula in cell D4 correctly returns USD 105 with the following inputs:100 EUR as amount and 1.05 as the EURUSD exchange rate. Why does this help?Assume that you accidentally delete cell D4 with the conversion formula and have torewrite it: instead of multiplying the amount with the exchange rate, you divide by it—after all, working with currencies can be confusing. When you run the above test,you will get a test failure as 100 EUR / 1.05 will not result in 105 USD anymore as thetest expects. Like this, you can detect and fix the formula before you hand the spread‐sheet over to your users.Pretty much all traditional programming languages offer one or more test frame‐works to write unit tests without much effort—but not Excel. Fortunately, the conceptof unit tests is simple enough and by connecting Excel with Python, you get access toPython’s powerful unit testing frameworks. While a more in-depth presentationof unit tests is beyond the scope of this book, I invite you to have a look at my blogpost, in which I walk you through the topic with practical examples.Unit tests are often set up to run automatically when you commit your code to yourversion control system. The next section explains what version control systems areand why they are hard to use with Excel files.Version controlAnother characteristic of professional programmers is that they use a system for ver‐sion control or source control. A version control system (VCS) tracks changes in yoursource code over time, allowing you to see who changed what, when, and why, andallows you to revert to old versions at any point in time. The most popular versioncontrol system nowadays is Git. It was originally created to manage the Linux sourcecode and since then has conquered the programming world—even Microsoft adop‐ted Git in 2017 to manage the Windows source code. In the Excel world, by contrast,the by far most popular version control system comes in the form of a folder wherefiles are archived like this:currency_converter_v1.xlsxcurrency_converter_v2_2020_04_21.xlsxcurrency_converter_final_edits_Bob.xlsxcurrency_converter_final_final.xlsxExcel Is a Programming Language | 9https://oreil.ly/crwTmhttps://oreil.ly/crwTmhttps://git-scm.comIf, unlike in this sample, the Excel developer sticks to a certain convention in the filename, there’s nothing inherently wrong with that. But keeping a version history ofyour files locally locks you out of important aspects of source control in the form ofeasier collaboration, peer reviews, sign-off processes, and audit logs. And if you wantto make your workbooks more secure and stable, you don’t want to miss out on thesethings. Most commonly, professional programmers use Git in connection with a web-based platform like GitHub, GitLab, Bitbucket, or Azure DevOps. These platformsallow you to work with so-called pull requests or merge requests. They allow develop‐ers to formally request that their changes are merged into the main codebase. A pullrequest offers the following information:• Who is the author of the changes• When were the changes made• What is the purpose of the changes as described in the commit message• What are the details of the changes as shown by the diff view, i.e., a view thathighlights changes in green fornew code and red for deleted codeThis allows a coworker or a team head to review the changes and spot irregularities.Often, an extra pair of eyes will be able to spot a glitch or two or give otherwise valua‐ble feedback to the programmer. With all these advantages, why do Excel developersprefer to use the local file system and their own naming convention instead of a pro‐fessional system like Git?• Many Excel users simply don’t know about Git or give up early on, as Git has arelatively steep learning curve.• Git allows multiple users to work on local copies of the same file in parallel. Afterall of them commit their work, Git can usually merge all the changes togetherwithout any manual intervention. This doesn’t work for Excel files: if they arebeing changed in parallel on separate copies, Git doesn’t know how to mergethese changes back into a single file.• Even if you manage to deal with the previous issues, Git simply doesn’t deliver asmuch value with Excel files as it does with text files: Git isn’t able to show changesbetween Excel files, preventing a proper peer review process.Because of all these issues, my company has come up with xltrail, a Git-based versioncontrol system that knows how to deal with Excel files. It hides away the Git complex‐ity so that business users feel comfortable using it and also allows you to connect toexternal Git systems, in case you are already tracking your files with GitHub, forexample. xltrail tracks the different components of a workbook, including cell formu‐las, named ranges, Power Queries, and VBA code, allowing you to take advantage ofthe classic benefits of version control including peer reviews.10 | Chapter 1: Why Python for Excel?https://xltrail.comAnother option to make version control easier with Excel is to move your businesslogic from Excel into Python files, something we will do in Chapter 10. As Pythonfiles are straightforward to track with Git, you will have the most important part ofyour spreadsheet tool under control.While this section is called Programming Best Practices, it is mainly pointing out whythey are harder to follow with Excel than with a traditional programming languagelike Python. Before we turn our attention to Python, I would like to briefly introducePower Query and Power Pivot, Microsoft’s attempt at modernizing Excel.Modern ExcelThe modern era of Excel started with Excel 2007 when the ribbon menu and the newfile formats (e.g., xlsx instead of xls) were introduced. However, the Excel community uses modern Excel to refer to the tools that were added with Excel 2010: most impor‐tantly Power Query and Power Pivot. They allow you to connect to external datasources and analyze data that is too big to fit into a spreadsheet. As their functionalityoverlaps with what we will do with pandas in Chapter 5, I will briefly introduce themin the first part of this section. The second part is about Power BI, which could bedescribed as a standalone business intelligence application combining the functional‐ity of Power Query and Power Pivot with visualization capabilities—and it has built-in support for Python!Power Query and Power PivotWith Excel 2010, Microsoft introduced an add-in called Power Query. Power Queryconnects to a multitude of data sources including Excel workbooks, CSV files, andSQL databases. It also offers connections to platforms like Salesforce and can even beextended to connect with systems that aren’t covered out of the box. Power Query’score functionality is dealing with datasets that are too big to fit into a spreadsheet.After loading the data, you may perform additional steps to clean and manipulate itso it arrives in a usable form in Excel. You could, for example, split a column intotwo, merge two tables, or filter and group your data. Since Excel 2016, Power Queryis not an add-in anymore but can be accessed directly on the ribbon tab Data via theGet Data button. Power Query is only partially available on macOS—however, it isbeing actively developed, so it should be fully supported in a future release of Excel.Power Pivot goes hand in hand with Power Query: conceptually, it’s the second stepafter acquiring and cleaning your data with Power Query. Power Pivot helps you toanalyze and present your data in an appealing way directly in Excel. Think of it as atraditional pivot table that, like Power Query, can deal with large datasets. PowerPivot allows you to define formal data models with relationships and hierarchies, andyou can add calculated columns via the DAX formula language. Power Pivot was alsoExcel Is a Programming Language | 116 You can learn more about how Instagram uses Python on their engineering blog.introduced with Excel 2010 but remains an add-in and is so far not available onmacOS.If you like to work with Power Query and Power Pivot and want to build dashboardson top of them, Power BI may be worth a look—let’s see why!Power BIPower BI is a standalone application that was released in 2015. It is Microsoft’s answerto business intelligence tools like Tableau or Qlik. Power BI Desktop is free, so if youwant to play around with it, go to the Power BI home page and download it—note,however, that Power BI Desktop is only available for Windows. Power BI wants tomake sense of large datasets by visualizing them in interactive dashboards. At its core,it is relying on the same Power Query and Power Pivot functionality as Excel. Com‐mercial plans allow you to collaborate and share dashboards online, but these are sep‐arate from the desktop version. The main reason why Power BI is exciting in thecontext of this book is that it’s been supporting Python scripts since 2018. Python canbe used for the query part as well as the visualization part by making use of Python’splotting libraries. To me, using Python in Power BI feels a bit clunky, but the impor‐tant part here is that Microsoft has recognized the importance of Python with regardto data analysis. Accordingly, the hopes are high that one day Python will find an offi‐cial way into Excel, too.So what’s so great about Python that it made it into Microsoft’s Power BI? The nextsection has a few answers!Python for ExcelExcel is all about storing, analyzing, and visualizing data. And since Python is partic‐ularly strong in the area of scientific computing, it’s a natural fit in combination withExcel. Python is also one of the very few languages that is appealing to both the pro‐fessional programmer as well as the beginner user who writes a few lines of codeevery few weeks. Professional programmers, on the one hand, like to work withPython because it is a general-purpose programming language and therefore allowsyou to achieve pretty much anything without jumping through hoops. Beginners, onthe other hand, like Python because it’s easier to learn than other languages. As a con‐sequence, Python is used both for ad hoc data analysis and smaller automation tasksas well as in huge production codebases like Instagram’s backend.6 This also meansthat when your Python-powered Excel tool becomes really popular, it’s easy to add aweb developer to the project who will turn your Excel-Python prototype into a fully-fledged web application. The unique advantage of Python is that the part with the12 | Chapter 1: Why Python for Excel?https://oreil.ly/SSnQGhttps://oreil.ly/I1kGjbusiness logic most likely doesn’t need to be rewritten but can be moved as-is fromthe Excel prototype to the production web environment.In this section, I’ll introduce Python’s core concepts and compare them with Exceland VBA. I will touch on code readability, Python’s standard library and packagemanager, the scientific computing stack, modern language features, and cross-platform compatibility. Let’s dive into readability first!Readability and MaintainabilityIf your code is readable, it means that it is easy to follow and understand—especiallyfor outsiders who haven’t written the
- Harmónicas em Sistemas Elétricos
- Choppers trabalhos_113340
- 8 Harmonica-1_035650
- Estatística: Conceitos e Exercícios
- On 0:31:03 Questão 9/10 - Ramos do Cooperativismo As é contribuir cooperativas agropecuárias são sociedades coletivas que funcionam como uma espéci...
- Considerado “Pai do Marketing”, Philip Kotler lançou recentemente um novo livro: Marketing 6.0, no qual afirma que vamos entrar na era das experiên...
- Ler em voz alta A Intel foi fundada em 1968, em Santa Clara, por dois visionários que trabalharam anteriormente na Fairchild Semiconductor. Eles in...
- Confira as questões e, no final, clique em CONFIRMAR: Questão 3 Fernando Pessoa é autor de uma obra complexa em que há a atribuição de autoria de c...
- No método de cálculo de isócronas com base no tempo de viagem, qual atributo das arestas geralmente é utilizado como métrica?
- . O processo burocrático equivale a um custo fixo, o que já é esperado em relação a eficiência financeira de uma organização. II. Exames médicos...
- De acordo com Neto, 2004 a realidade aumentada utiliza 04 etapas para o seu funcionamento, quais são elas respectivamente Múltipla Escolha: Cri...
- xistem três referenciais teóricos que orientam a definição e a classificação da DI: o sistema de Classificação Internacional de Doenças e Problemas...
- Questão 09 (ENEM 2001) O mundo é grande O mundo é grande e cabe Nesta janela sobre o mar. O mar é grande e cabe Na cama e no colchão de amar. O amo...
- / / STQQ55D 0 a no a) Renal go for b) Hipsalengine gates c) Urimaris cars d) Renal cares e) Huipsale gin causes 9) Banco de profeina, cousis the in...
- QUESTÃO 2 - Segundo Hobbes, por que as pessoas saem de um estado de natureza e constituem uma sociedade civil?
- QUESTÃO 1 - Relacione dúvida e certeza conforme as duas primeiras meditações metafísicas de René Descartes.
- B) Apresente TRÊS motivos que reforcem a tese segundo a qual a Amazônia é um patrimônio da humanidade, devendo ser colocada, portanto, acima de int...
- ADMINISTRAÇÃO DE SISTEMAS DE INFORMAÇÃO av. parcial quimau daniel
- Gerenciamento de segurança
Mais conteúdos dessa disciplina
- Harmónicas em Sistemas Elétricos
- Choppers trabalhos_113340
- 8 Harmonica-1_035650
- Estatística: Conceitos e Exercícios
- On 0:31:03 Questão 9/10 - Ramos do Cooperativismo As é contribuir cooperativas agropecuárias são sociedades coletivas que funcionam como uma espéci...
- Considerado “Pai do Marketing”, Philip Kotler lançou recentemente um novo livro: Marketing 6.0, no qual afirma que vamos entrar na era das experiên...
- Ler em voz alta A Intel foi fundada em 1968, em Santa Clara, por dois visionários que trabalharam anteriormente na Fairchild Semiconductor. Eles in...
- Confira as questões e, no final, clique em CONFIRMAR: Questão 3 Fernando Pessoa é autor de uma obra complexa em que há a atribuição de autoria de c...
- No método de cálculo de isócronas com base no tempo de viagem, qual atributo das arestas geralmente é utilizado como métrica?
- . O processo burocrático equivale a um custo fixo, o que já é esperado em relação a eficiência financeira de uma organização. II. Exames médicos...
- De acordo com Neto, 2004 a realidade aumentada utiliza 04 etapas para o seu funcionamento, quais são elas respectivamente Múltipla Escolha: Cri...
- xistem três referenciais teóricos que orientam a definição e a classificação da DI: o sistema de Classificação Internacional de Doenças e Problemas...
- Questão 09 (ENEM 2001) O mundo é grande O mundo é grande e cabe Nesta janela sobre o mar. O mar é grande e cabe Na cama e no colchão de amar. O amo...
- / / STQQ55D 0 a no a) Renal go for b) Hipsalengine gates c) Urimaris cars d) Renal cares e) Huipsale gin causes 9) Banco de profeina, cousis the in...
- QUESTÃO 2 - Segundo Hobbes, por que as pessoas saem de um estado de natureza e constituem uma sociedade civil?
- QUESTÃO 1 - Relacione dúvida e certeza conforme as duas primeiras meditações metafísicas de René Descartes.
- B) Apresente TRÊS motivos que reforcem a tese segundo a qual a Amazônia é um patrimônio da humanidade, devendo ser colocada, portanto, acima de int...
- ADMINISTRAÇÃO DE SISTEMAS DE INFORMAÇÃO av. parcial quimau daniel
- Gerenciamento de segurança